Conversation
Reviewer's GuideAdds Mellanox NIC temperature support to jtop by introducing MLNX_OFED-based temperature discovery and reading, extending the temperature core to handle numeric millidegree values alongside sysfs paths, wiring Mellanox sensors into TemperatureService, and providing documentation and verification/test scripts for the new behavior. Class diagram for updated temperature core with Mellanox supportclassDiagram
class TemperatureService {
- dict _temperature
+ TemperatureService()
+ dict get_status()
}
class TemperatureUtils {
+ dict read_temperature(dict data)
+ dict get_hwmon_thermal_system(str root_dir)
+ dict get_mellanox_temperature()
}
class ExternalTools {
<<utility>>
+ mget_temp
+ lspci
}
TemperatureService --> TemperatureUtils : uses
TemperatureUtils --> ExternalTools : calls
%% Details of behaviors
class read_temperature_details {
+ handles_path_values
+ handles_numeric_millidegree_values
+ returns_celsius_values
}
class get_mellanox_temperature_details {
+ detect_mlnx_ofed_via_mget_temp
+ discover_devices_with_lspci
+ read_temp_with_mget_temp
+ sudo_fallback_on_failure
+ store_temp_in_millidegrees
+ build_mlx_sensor_keys
}
TemperatureUtils .. read_temperature_details
TemperatureUtils .. get_mellanox_temperature_details
class get_status_details {
+ handle_path_based_sensors
+ handle_numeric_mlx_sensors
+ convert_mlx_millidegrees_to_celsius
+ set_default_max_crit_for_mlx
+ mark_sensor_online_offline
}
TemperatureService .. get_status_details
Flow diagram for Mellanox NIC temperature detection and readingflowchart TD
A_start["Start get_mellanox_temperature"] --> B_check_mget["Run which mget_temp"]
B_check_mget -->|mget_temp not found or error| Z_end_empty["Return empty temperature dict"]
B_check_mget -->|mget_temp found| C_lspci["Run lspci -d 15b3: -D to list Mellanox devices"]
C_lspci -->|no devices or error| Z_end_empty
C_lspci -->|devices found| D_for_each["For each lspci device line"]
D_for_each --> E_parse_line["Parse bus_addr and device_name"]
E_parse_line --> F_filter["Is ConnectX or MT device?"]
F_filter -->|no| D_for_each
F_filter -->|yes| G_try_mget["Run mget_temp -d bus_addr (no sudo)"]
G_try_mget -->|timeout or exception| D_for_each
G_try_mget -->|returncode != 0| H_try_sudo["Run sudo mget_temp -d bus_addr"]
G_try_mget -->|returncode == 0 and stdout| I_parse_temp["Parse temperature from stdout"]
H_try_sudo -->|timeout or exception| D_for_each
H_try_sudo -->|returncode != 0| D_for_each
H_try_sudo -->|returncode == 0 and stdout| I_parse_temp
I_parse_temp -->|parse error| D_for_each
I_parse_temp -->|success| J_store["Store temperature[mlx_bus_addr] = { temp: celsius * 1000 }"]
J_store --> D_for_each
D_for_each -->|no more lines| K_return["Return Mellanox temperature dict"]
K_return --> Z_end["End get_mellanox_temperature"]
Flow diagram for TemperatureService initialization and status with Mellanox sensorsflowchart TD
A_init["TemperatureService.__init__"] --> B_init_dict["Initialize _temperature as empty dict"]
B_init_dict --> C_virtual["Load virtual thermal zones"]
C_virtual --> D_hwmon["Load hwmon thermal sensors via get_hwmon_thermal_system"]
D_hwmon --> E_update_hwmon["Update _temperature with hwmon sensors"]
E_update_hwmon --> F_mlx["Call get_mellanox_temperature"]
F_mlx --> G_update_mlx["Update _temperature with Mellanox sensors"]
G_update_mlx --> H_check_empty["If _temperature is empty, log warning"]
H_check_empty --> I_sort["Sort sensors"]
I_sort --> J_init_done["Initialization complete"]
J_init_done --> K_get_status["TemperatureService.get_status"]
K_get_status --> L_iter["For each sensor in _temperature"]
L_iter --> M_is_mlx["sensor.get(temp) is numeric?"]
M_is_mlx -->|yes Mellanox| N_convert_mlx["Convert millidegrees to Celsius"]
N_convert_mlx --> O_defaults_mlx["Set max=84, crit=100"]
O_defaults_mlx --> P_online_mlx["Set online = temp != TEMPERATURE_OFFLINE"]
P_online_mlx --> Q_add_status_mlx["Add Mellanox sensor to status dict"]
M_is_mlx -->|no path-based| R_read_path["Call read_temperature for sensor"]
R_read_path --> S_online_path["Set online = temp != TEMPERATURE_OFFLINE"]
S_online_path --> T_add_status_path["Add path-based sensor to status dict"]
Q_add_status_mlx --> U_next["Next sensor"]
T_add_status_path --> U_next
U_next -->|more sensors| L_iter
U_next -->|no more sensors| V_return_status["Return full status dict"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- The new Mellanox integration relies on shelling out to
whichand tosudo mget_tempfrom within library code; consider usingshutil.whichfor detection and avoidingsudoinside the service (e.g., document required permissions or group membership instead), so the temperature service remains usable in non-interactive and constrained environments. - Several new top-level helper files (
verify_mellanox_fix.py,test_mellanox_temp.py,MELLANOX_FIX_SUMMARY.md,MELLANOX_TEMP_README.md,CHANGES_SUMMARY.md) are added; it may be cleaner to either move these under a dedicatedtools/ordocs/directory or keep only the minimal runtime-necessary pieces in the repo root to reduce clutter.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new Mellanox integration relies on shelling out to `which` and to `sudo mget_temp` from within library code; consider using `shutil.which` for detection and avoiding `sudo` inside the service (e.g., document required permissions or group membership instead), so the temperature service remains usable in non-interactive and constrained environments.
- Several new top-level helper files (`verify_mellanox_fix.py`, `test_mellanox_temp.py`, `MELLANOX_FIX_SUMMARY.md`, `MELLANOX_TEMP_README.md`, `CHANGES_SUMMARY.md`) are added; it may be cleaner to either move these under a dedicated `tools/` or `docs/` directory or keep only the minimal runtime-necessary pieces in the repo root to reduce clutter.
## Individual Comments
### Comment 1
<location> `jtop/core/temperature.py:151-160` </location>
<code_context>
+ temp_result = None
+ try:
+ # Try without sudo first
+ temp_result = subprocess.run(
+ ['mget_temp', '-d', bus_addr],
+ capture_output=True,
+ text=True,
+ timeout=2
+ )
+ if temp_result.returncode != 0:
+ # If failed without sudo, try with sudo
+ temp_result = subprocess.run(
+ ['sudo', 'mget_temp', '-d', bus_addr],
+ capture_output=True,
+ text=True,
</code_context>
<issue_to_address>
**🚨 issue (security):** Invoking `sudo mget_temp` from a library context is risky and may hang on password prompts.
This can hang in non-interactive contexts if passwordless sudo isn’t configured, and it silently escalates privileges from a read-only-looking API. Please either remove the sudo fallback and document that `mget_temp` must be directly runnable, or make sudo use explicitly opt-in via configuration and disabled by default.
</issue_to_address>
### Comment 2
<location> `jtop/core/temperature.py:134` </location>
<code_context>
+ # Find all Mellanox devices
+ try:
+ # Get list of Mellanox devices
+ devices_result = subprocess.run(['lspci', '-d', '15b3:', '-D'], capture_output=True, text=True)
+ if devices_result.returncode == 0 and devices_result.stdout.strip():
+ device_lines = devices_result.stdout.strip().split('\n')
</code_context>
<issue_to_address>
**suggestion (performance):** Running `lspci` on every service initialization may be expensive and lacks a timeout.
On some systems this subprocess can be relatively heavy, and without a timeout a hung `lspci` would block service startup. Consider adding a timeout to the call and/or caching the discovery so it only runs periodically instead of on every `TemperatureService` construction.
Suggested implementation:
```python
# Get list of Mellanox devices
try:
devices_result = subprocess.run(
['lspci', '-d', '15b3:', '-D'],
capture_output=True,
text=True,
timeout=5,
)
except subprocess.TimeoutExpired:
logger.warning("Timeout while running lspci to detect Mellanox devices")
devices_result = None
if devices_result and devices_result.returncode == 0 and devices_result.stdout.strip():
```
To fully address the suggestion about avoiding running lspci on every service initialization, consider:
1. Introducing a module-level cache (e.g., _MELLANOX_DEVICES_CACHE and _MELLANOX_DEVICES_CACHE_TS) and a TTL.
2. Wrapping the lspci execution in a helper function (e.g., _get_mellanox_devices()) that returns cached results if they are still fresh, and only re-runs lspci when the cache has expired.
</issue_to_address>
### Comment 3
<location> `jtop/core/temperature.py:173-182` </location>
<code_context>
+ temp_value = temp_result.stdout.strip()
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Parsing `mget_temp` output as a bare float may be brittle if the CLI adds labels or units.
This assumes `mget_temp` always returns a plain numeric string. If the CLI ever adds labels, units, or extra lines, `float(temp_result.stdout.strip())` will fail. Parsing just the first numeric token (e.g., via `split()` or a small regex) or restricting to the first line would make this more resilient to format changes.
Suggested implementation:
```python
if temp_result.returncode == 0 and temp_result.stdout.strip():
raw_output = temp_result.stdout.strip()
# Use only the first line and extract the first numeric token to be resilient to format changes
first_line = raw_output.splitlines()[0]
match = re.search(r'([-+]?(?:\d+(?:\.\d*)?|\.\d+))', first_line)
if not match:
logger.warning(f"Could not find numeric temperature in mget_temp output for {bus_addr}: {first_line!r}")
continue
temp_value_str = match.group(1)
try:
temp_celsius = float(temp_value_str)
# Create a virtual temperature file path for compatibility
sensor_key = f"mlx_{bus_addr.replace(':', '_').replace('.', '_')}"
temperature[sensor_key] = {
'temp': temp_celsius * 1000.0 # Store in millidegrees for consistency
}
logger.info(f"Found Mellanox NIC temperature: {device_name} = {temp_celsius}°C")
except ValueError:
logger.warning(f"Could not parse temperature from mget_temp for {bus_addr}: {temp_value_str!r}")
```
1. Add `import re` near the top of `jtop/core/temperature.py` alongside the other imports, for example:
<<<<<<< SEARCH
import logging
=======
import logging
import re
>>>>>>> REPLACE
Adjust the SEARCH line (`import logging`) to match an existing import in your file if it differs.
</issue_to_address>
### Comment 4
<location> `CHANGES_SUMMARY.md:17` </location>
<code_context>
+**Key Features:**
+- Detects MLNX_OFED installation by checking for `mget_temp`
+- Uses `lspci -d 15b3:` to find Mellanox devices
+- Calls `sudo mget_temp -d <device>` for each NIC
+- Stores temperatures in millidegrees for consistency
+- Comprehensive error handling and logging
</code_context>
<issue_to_address>
**issue:** Align this description of `mget_temp` invocation with the documented sudo fallback behavior.
This line still describes always calling `sudo mget_temp -d <device>`, but the implementation now runs `mget_temp` first and only falls back to sudo on failure. Please update this bullet to reflect the new fallback behavior so it matches the actual code and the rest of the documentation.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| temp_result = subprocess.run( | ||
| ['mget_temp', '-d', bus_addr], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=2 | ||
| ) | ||
| if temp_result.returncode != 0: | ||
| # If failed without sudo, try with sudo | ||
| temp_result = subprocess.run( | ||
| ['sudo', 'mget_temp', '-d', bus_addr], |
There was a problem hiding this comment.
🚨 issue (security): Invoking sudo mget_temp from a library context is risky and may hang on password prompts.
This can hang in non-interactive contexts if passwordless sudo isn’t configured, and it silently escalates privileges from a read-only-looking API. Please either remove the sudo fallback and document that mget_temp must be directly runnable, or make sudo use explicitly opt-in via configuration and disabled by default.
| # Find all Mellanox devices | ||
| try: | ||
| # Get list of Mellanox devices | ||
| devices_result = subprocess.run(['lspci', '-d', '15b3:', '-D'], capture_output=True, text=True) |
There was a problem hiding this comment.
suggestion (performance): Running lspci on every service initialization may be expensive and lacks a timeout.
On some systems this subprocess can be relatively heavy, and without a timeout a hung lspci would block service startup. Consider adding a timeout to the call and/or caching the discovery so it only runs periodically instead of on every TemperatureService construction.
Suggested implementation:
# Get list of Mellanox devices
try:
devices_result = subprocess.run(
['lspci', '-d', '15b3:', '-D'],
capture_output=True,
text=True,
timeout=5,
)
except subprocess.TimeoutExpired:
logger.warning("Timeout while running lspci to detect Mellanox devices")
devices_result = None
if devices_result and devices_result.returncode == 0 and devices_result.stdout.strip():To fully address the suggestion about avoiding running lspci on every service initialization, consider:
- Introducing a module-level cache (e.g., _MELLANOX_DEVICES_CACHE and _MELLANOX_DEVICES_CACHE_TS) and a TTL.
- Wrapping the lspci execution in a helper function (e.g., _get_mellanox_devices()) that returns cached results if they are still fresh, and only re-runs lspci when the cache has expired.
| temp_value = temp_result.stdout.strip() | ||
| try: | ||
| temp_celsius = float(temp_value) | ||
| # Create a virtual temperature file path for compatibility | ||
| sensor_key = f"mlx_{bus_addr.replace(':', '_').replace('.', '_')}" | ||
| temperature[sensor_key] = { | ||
| 'temp': temp_celsius * 1000.0 # Store in millidegrees for consistency | ||
| } | ||
| logger.info(f"Found Mellanox NIC temperature: {device_name} = {temp_celsius}°C") | ||
| except ValueError: |
There was a problem hiding this comment.
suggestion (bug_risk): Parsing mget_temp output as a bare float may be brittle if the CLI adds labels or units.
This assumes mget_temp always returns a plain numeric string. If the CLI ever adds labels, units, or extra lines, float(temp_result.stdout.strip()) will fail. Parsing just the first numeric token (e.g., via split() or a small regex) or restricting to the first line would make this more resilient to format changes.
Suggested implementation:
if temp_result.returncode == 0 and temp_result.stdout.strip():
raw_output = temp_result.stdout.strip()
# Use only the first line and extract the first numeric token to be resilient to format changes
first_line = raw_output.splitlines()[0]
match = re.search(r'([-+]?(?:\d+(?:\.\d*)?|\.\d+))', first_line)
if not match:
logger.warning(f"Could not find numeric temperature in mget_temp output for {bus_addr}: {first_line!r}")
continue
temp_value_str = match.group(1)
try:
temp_celsius = float(temp_value_str)
# Create a virtual temperature file path for compatibility
sensor_key = f"mlx_{bus_addr.replace(':', '_').replace('.', '_')}"
temperature[sensor_key] = {
'temp': temp_celsius * 1000.0 # Store in millidegrees for consistency
}
logger.info(f"Found Mellanox NIC temperature: {device_name} = {temp_celsius}°C")
except ValueError:
logger.warning(f"Could not parse temperature from mget_temp for {bus_addr}: {temp_value_str!r}")- Add
import renear the top ofjtop/core/temperature.pyalongside the other imports, for example:
<<<<<<< SEARCH
import logging
import logging
import re
REPLACE
Adjust the SEARCH line (import logging) to match an existing import in your file if it differs.
| **Key Features:** | ||
| - Detects MLNX_OFED installation by checking for `mget_temp` | ||
| - Uses `lspci -d 15b3:` to find Mellanox devices | ||
| - Calls `sudo mget_temp -d <device>` for each NIC |
There was a problem hiding this comment.
issue: Align this description of mget_temp invocation with the documented sudo fallback behavior.
This line still describes always calling sudo mget_temp -d <device>, but the implementation now runs mget_temp first and only falls back to sudo on failure. Please update this bullet to reflect the new fallback behavior so it matches the actual code and the rest of the documentation.
Summary by Sourcery
Add Mellanox NIC temperature monitoring support using MLNX_OFED utilities and integrate these readings into the existing temperature service.
New Features:
Enhancements:
Documentation:
Tests: