Skip to content

Commit beba194

Browse files
committed
Fix rules
1 parent f6b16a7 commit beba194

File tree

1 file changed

+148
-1
lines changed

1 file changed

+148
-1
lines changed

.rules/new_models_best_practice.mdc

Lines changed: 148 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,154 @@ Each example should contain the following files:
3333

3434
- Reference the main repository README to help users understand how to install and run the ML backend.
3535
- Include labeling configuration examples in the example README so users can quickly reproduce training and inference.
36-
- Provide troubleshooting tips or links to Label Studio documentation such as [Writing your own ML backend](mdc:https:/labelstud.io/guide/ml_create): https://labelstud.io/guide/ml_create.
36+
- Provide troubleshooting tips or links to Label Studio documentation such as [Writing your own ML backend](https://labelstud.io/guide/ml_create).
37+
38+
## 3.1. Security Best Practices
39+
40+
When implementing ML backends, follow these security guidelines:
41+
42+
- **Model Serialization**: Use secure serialization methods (e.g., PyTorch `state_dict` with `weights_only=True` for PyTorch 2.6+)
43+
- **Input Validation**: Validate all user inputs, file formats, and data types before processing
44+
- **Environment Variables**: Never hardcode sensitive information like API keys; use environment variables
45+
- **File Access**: Restrict file system access to designated directories (`MODEL_DIR`, temp directories)
46+
- **Dependencies**: Pin dependency versions and regularly update for security patches
47+
48+
Example secure model loading:
49+
```python
50+
# Secure PyTorch model loading (PyTorch 2.6+)
51+
try:
52+
state_dict = torch.load(model_path, weights_only=True)
53+
model.load_state_dict(state_dict)
54+
except Exception as e:
55+
logger.error(f"Failed to load model securely: {e}")
56+
# Fallback or error handling
57+
```
58+
59+
## 3.2. Error Handling and Logging
60+
61+
Implement comprehensive error handling and logging:
62+
63+
- **Structured Logging**: Use consistent log levels and structured messages
64+
- **Graceful Degradation**: Handle missing data, corrupted files, and network issues
65+
- **User-Friendly Errors**: Return meaningful error messages to Label Studio
66+
- **Debug Information**: Log sufficient detail for troubleshooting without exposing sensitive data
67+
68+
Example logging pattern:
69+
```python
70+
import logging
71+
logger = logging.getLogger(__name__)
72+
73+
def predict(self, tasks):
74+
logger.info(f"Starting prediction for {len(tasks)} tasks")
75+
try:
76+
# Prediction logic
77+
logger.debug(f"Processed task {task_id} successfully")
78+
except Exception as e:
79+
logger.error(f"Prediction failed for task {task_id}: {e}")
80+
return {} # Return empty result gracefully
81+
```
82+
83+
## 3.3. Performance and Scalability
84+
85+
Consider performance implications for production deployment:
86+
87+
- **Memory Management**: Use generators for large datasets, clear unused variables, handle memory leaks
88+
- **Batch Processing**: Process multiple tasks together when possible to improve throughput
89+
- **Model Loading**: Cache models in memory to avoid repeated loading from disk
90+
- **Resource Monitoring**: Log memory and CPU usage for monitoring and optimization
91+
- **Async Operations**: Use async/await for I/O operations when appropriate
92+
93+
Example efficient model caching:
94+
```python
95+
_model_cache = {}
96+
97+
def get_model(self, model_path):
98+
if model_path not in _model_cache:
99+
logger.info(f"Loading model from {model_path}")
100+
_model_cache[model_path] = self._load_model(model_path)
101+
return _model_cache[model_path]
102+
```
103+
104+
## 3.4. Model Versioning and Compatibility
105+
106+
Implement proper model versioning for production systems:
107+
108+
- **Version Tracking**: Include model version in predictions and logs
109+
- **Backwards Compatibility**: Handle multiple model versions gracefully
110+
- **Migration Strategies**: Provide clear upgrade paths for model updates
111+
- **Rollback Support**: Maintain ability to revert to previous model versions
112+
113+
Example versioning pattern:
114+
```python
115+
def setup(self):
116+
self.set("model_version", f"{self.__class__.__name__}-v1.2.3")
117+
118+
def predict(self, tasks):
119+
return ModelResponse(
120+
predictions=predictions,
121+
model_version=self.get("model_version")
122+
)
123+
```
124+
125+
## 3.5. CI/CD Integration
126+
127+
Design backends for automated testing and deployment:
128+
129+
- **Containerization**: Ensure Docker containers build consistently across environments
130+
- **Test Automation**: Include comprehensive test suites that run in CI pipelines
131+
- **Health Checks**: Implement `/health` endpoints for deployment monitoring
132+
- **Configuration Management**: Use environment variables for all configuration
133+
- **Dependency Management**: Pin all dependencies with specific versions
134+
135+
Example health check endpoint:
136+
```python
137+
@app.route('/health')
138+
def health():
139+
return {"status": "healthy", "model_loaded": _model is not None}
140+
```
141+
142+
## 3.6. Data Handling Patterns
143+
144+
Implement robust data processing for different scenarios:
145+
146+
- **File Format Support**: Handle multiple data formats (CSV, JSON, images, audio) with proper validation
147+
- **Data Preprocessing**: Implement consistent preprocessing pipelines for training and prediction
148+
- **Type Safety**: Use proper type conversion and validation for different data types
149+
- **Streaming Data**: Support large files that don't fit in memory using streaming approaches
150+
- **Data Caching**: Cache preprocessed data when appropriate to improve performance
151+
152+
Example robust data loading:
153+
```python
154+
def _read_data(self, task, path):
155+
"""Load data with format detection and error handling."""
156+
try:
157+
if path.endswith('.csv'):
158+
csv_str = self.preload_task_data(task, value=path)
159+
return pd.read_csv(io.StringIO(csv_str))
160+
elif path.endswith('.json'):
161+
json_str = self.preload_task_data(task, value=path)
162+
return json.loads(json_str)
163+
else:
164+
raise ValueError(f"Unsupported file format: {path}")
165+
except Exception as e:
166+
logger.error(f"Failed to load data from {path}: {e}")
167+
return None
168+
```
169+
170+
Common data validation pattern:
171+
```python
172+
def _validate_data(self, df, required_columns):
173+
"""Validate DataFrame has required structure."""
174+
if df is None or df.empty:
175+
return False
176+
177+
missing_cols = set(required_columns) - set(df.columns)
178+
if missing_cols:
179+
logger.error(f"Missing required columns: {missing_cols}")
180+
return False
181+
182+
return True
183+
```
37184

38185
## 4. Testing
39186

0 commit comments

Comments
 (0)