-
Notifications
You must be signed in to change notification settings - Fork 866
chore: deleteme #450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: deleteme #450
Conversation
- Add utility functions for S3 configuration, URL generation, and file uploads. - Enhance ingestion flow to optionally upload digests to S3 if enabled. - Modify API endpoints to redirect downloads to S3 if files are stored there. - Extend `IngestResponse` schema to include S3 URL when applicable. - Introduce `get_current_commit_hash` utility to retrieve commit SHA in ingestion. - add Docker Compose configuration for dev/prod environments with documented usage details - integrate MinIO S3-compatible storage for local development, including bucket auto-setup and app credentials - add S3 storage toggle, test service in Docker Compose, and boto3 dependency - enforce UUID type for ingest_id, resolve comments - Implement `JSONFormatter` and methods for structured logging. - Integrate logging into S3 client creation, uploads, and URL lookups. - Enhance logging with extra fields for better traceability. - add optional S3 directory prefix support - remove unused test service from Docker Compose configuration - improve `get_s3_config` to handle optional environment variables more robustly - add centralized JSON logging and integrate into S3 utilities Co-authored-by: Filip Christiansen <[email protected]>
⚙️ Preview environment was undeployed. |
# Normalize and validate the directory path | ||
directory = (TMP_BASE_PATH / ingest_id).resolve() | ||
directory = (TMP_BASE_PATH / str(ingest_id)).resolve() |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 21 days ago
To fix the problem, we should ensure that the ingest_id
parameter is strictly validated as a UUID and that the constructed path is securely checked to prevent path traversal. Since FastAPI already enforces the type as UUID
, we can further strengthen the check by:
- Using the UUID object directly (not its string representation) to construct the path, which avoids any unexpected characters.
- Ensuring that the resolved path is a subpath of the resolved base directory using robust path comparison (not just string prefix matching). The best way is to use the
Path.relative_to()
method, which will raise aValueError
if the path is not within the base directory. - Optionally, we can add a check to ensure that the directory name matches the UUID format, though FastAPI should already enforce this.
The changes should be made in the download_ingest
function in src/server/routers/ingest.py
, specifically in the region where the path is constructed and validated (lines 154-160). No new imports are needed, as we can use the standard pathlib
library.
-
Copy modified lines R158-R161
@@ -157,3 +157,6 @@ | ||
|
||
if not str(directory).startswith(str(TMP_BASE_PATH.resolve())): | ||
try: | ||
# Ensure the resolved directory is a subpath of TMP_BASE_PATH | ||
directory.relative_to(TMP_BASE_PATH.resolve()) | ||
except ValueError: | ||
logger.error(f"Invalid ingest ID - path traversal attempt - ingest_id: {ingest_id}, directory_path: {str(directory)}, tmp_base_path: {str(TMP_BASE_PATH.resolve())}") |
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail=f"Invalid ingest ID: {ingest_id!r}") | ||
|
||
if not directory.is_dir(): | ||
logger.error(f"Digest directory not found - ingest_id: {ingest_id}, directory_path: {str(directory)}, directory_exists: {directory.exists()}, is_directory: {directory.is_dir() if directory.exists() else False}") |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 21 days ago
To fix the issue, we will enhance the validation logic for the directory
path to ensure it is safe and strictly contained within the TMP_BASE_PATH
. This involves:
- Normalizing the path using
.resolve()
to remove any..
segments. - Validating that the resolved path starts with
TMP_BASE_PATH
and does not contain any unexpected characters or patterns. - Adding a stricter check to ensure the
ingest_id
conforms to expected UUID formatting before constructing the path.
Changes required:
- Update the validation logic for
directory
to include stricter checks. - Ensure the
ingest_id
is validated as a proper UUID before using it to construct the path.
-
Copy modified lines R154-R160 -
Copy modified lines R165-R167
@@ -153,2 +153,9 @@ | ||
# Normalize and validate the directory path | ||
try: | ||
# Validate that ingest_id is a proper UUID | ||
UUID(str(ingest_id)) | ||
except ValueError: | ||
logger.error(f"Invalid ingest ID format - ingest_id: {ingest_id}") | ||
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail=f"Invalid ingest ID format: {ingest_id!r}") | ||
|
||
directory = (TMP_BASE_PATH / str(ingest_id)).resolve() | ||
@@ -157,9 +164,5 @@ | ||
|
||
if not str(directory).startswith(str(TMP_BASE_PATH.resolve())): | ||
logger.error(f"Invalid ingest ID - path traversal attempt - ingest_id: {ingest_id}, directory_path: {str(directory)}, tmp_base_path: {str(TMP_BASE_PATH.resolve())}") | ||
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail=f"Invalid ingest ID: {ingest_id!r}") | ||
|
||
if not directory.is_dir(): | ||
logger.error(f"Digest directory not found - ingest_id: {ingest_id}, directory_path: {str(directory)}, directory_exists: {directory.exists()}, is_directory: {directory.is_dir() if directory.exists() else False}") | ||
raise HTTPException(status_code=status.HTTP_404_NOT_FOUND, detail=f"Digest {ingest_id!r} not found") | ||
if not directory.is_dir() or not str(directory).startswith(str(TMP_BASE_PATH.resolve())): | ||
logger.error(f"Invalid ingest ID or path traversal attempt - ingest_id: {ingest_id}, directory_path: {str(directory)}, tmp_base_path: {str(TMP_BASE_PATH.resolve())}") | ||
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail=f"Invalid ingest ID or path traversal attempt: {ingest_id!r}") | ||
|
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail=f"Invalid ingest ID: {ingest_id!r}") | ||
|
||
if not directory.is_dir(): | ||
logger.error(f"Digest directory not found - ingest_id: {ingest_id}, directory_path: {str(directory)}, directory_exists: {directory.exists()}, is_directory: {directory.is_dir() if directory.exists() else False}") |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 21 days ago
To fix the issue, we will add validation to ensure that the ingest_id
parameter is a valid UUID and does not contain any unexpected characters or patterns. This will prevent malicious input from being used to construct unsafe paths. Additionally, we will ensure that the directory
path is normalized and verified to be within the TMP_BASE_PATH
.
Steps to fix:
- Validate that
ingest_id
is a valid UUID before using it to construct thedirectory
path. - Retain the existing normalization and prefix-checking logic for
directory
. - Log an error and raise an HTTP 400 (Bad Request) exception if
ingest_id
is invalid.
-
Copy modified line R107 -
Copy modified lines R154-R161
@@ -106,3 +106,3 @@ | ||
async def download_ingest( | ||
ingest_id: UUID, | ||
ingest_id: str, | ||
) -> Union[RedirectResponse, FileResponse]: # noqa: FA100 (future-rewritable-type-annotation) (pydantic) | ||
@@ -153,3 +153,10 @@ | ||
# Normalize and validate the directory path | ||
directory = (TMP_BASE_PATH / str(ingest_id)).resolve() | ||
try: | ||
# Validate that ingest_id is a valid UUID | ||
ingest_id_uuid = UUID(ingest_id) | ||
except ValueError: | ||
logger.error(f"Invalid ingest ID format - ingest_id: {ingest_id}") | ||
raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail=f"Invalid ingest ID format: {ingest_id!r}") | ||
|
||
directory = (TMP_BASE_PATH / str(ingest_id_uuid)).resolve() | ||
|
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail=f"Invalid ingest ID: {ingest_id!r}") | ||
|
||
if not directory.is_dir(): | ||
logger.error(f"Digest directory not found - ingest_id: {ingest_id}, directory_path: {str(directory)}, directory_exists: {directory.exists()}, is_directory: {directory.is_dir() if directory.exists() else False}") |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 21 days ago
To fix the problem, we should strengthen the validation that ensures the constructed path is strictly within the intended base directory. Instead of using a string-based startswith
check, we should use pathlib.Path.relative_to
to confirm that the resolved path is a subpath of TMP_BASE_PATH
. If relative_to
raises a ValueError
, the path is outside the base directory and should be rejected. This approach is robust against symlinks and other filesystem tricks. The fix should be applied in the region where the path validation occurs (lines 153–160). No new imports are needed, as pathlib
is already in use via TMP_BASE_PATH
.
-
Copy modified lines R158-R161
@@ -157,3 +157,6 @@ | ||
|
||
if not str(directory).startswith(str(TMP_BASE_PATH.resolve())): | ||
try: | ||
# Ensure the resolved directory is strictly within TMP_BASE_PATH | ||
directory.relative_to(TMP_BASE_PATH.resolve()) | ||
except ValueError: | ||
logger.error(f"Invalid ingest ID - path traversal attempt - ingest_id: {ingest_id}, directory_path: {str(directory)}, tmp_base_path: {str(TMP_BASE_PATH.resolve())}") |
raise HTTPException(status_code=status.HTTP_404_NOT_FOUND, detail=f"Digest {ingest_id!r} not found") | ||
|
||
try: | ||
# List all txt files for debugging | ||
txt_files = list(directory.glob("*.txt")) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 21 days ago
To fix the issue, we will enhance the validation of the directory
path to ensure it is securely contained within the TMP_BASE_PATH
. Specifically:
- Retain the use of
.resolve()
to normalize the path. - Add a check to ensure that
directory
is a direct subdirectory ofTMP_BASE_PATH
by comparing its parent directory toTMP_BASE_PATH
. - Log any invalid paths and raise an appropriate HTTP exception if the validation fails.
This approach ensures that even if symbolic links or other edge cases are present, the directory
path cannot escape the TMP_BASE_PATH
.
-
Copy modified lines R158-R159
@@ -157,3 +157,4 @@ | ||
|
||
if not str(directory).startswith(str(TMP_BASE_PATH.resolve())): | ||
# Validate that the directory is within TMP_BASE_PATH and is a direct subdirectory | ||
if not str(directory).startswith(str(TMP_BASE_PATH.resolve())) or directory.parent != TMP_BASE_PATH.resolve(): | ||
logger.error(f"Invalid ingest ID - path traversal attempt - ingest_id: {ingest_id}, directory_path: {str(directory)}, tmp_base_path: {str(TMP_BASE_PATH.resolve())}") |
except StopIteration as exc: | ||
# List all files in directory for debugging | ||
all_files = list(directory.glob("*")) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 21 days ago
To address the issue, we will enhance the validation logic for the directory
path. Specifically:
- Ensure that
ingest_id
is strictly validated as a UUID before constructing the path. - Use
os.path.normpath
to normalize the path and verify that it is contained withinTMP_BASE_PATH
. - Add explicit error handling for edge cases in path resolution.
These changes will ensure that the constructed path is safe and cannot be exploited for path traversal or other attacks.
-
Copy modified lines R154-R172
@@ -153,9 +153,21 @@ | ||
# Normalize and validate the directory path | ||
directory = (TMP_BASE_PATH / str(ingest_id)).resolve() | ||
|
||
logger.info(f"Local directory path resolved - ingest_id: {ingest_id}, directory_path: {str(directory)}, tmp_base_path: {str(TMP_BASE_PATH.resolve())}") | ||
|
||
if not str(directory).startswith(str(TMP_BASE_PATH.resolve())): | ||
logger.error(f"Invalid ingest ID - path traversal attempt - ingest_id: {ingest_id}, directory_path: {str(directory)}, tmp_base_path: {str(TMP_BASE_PATH.resolve())}") | ||
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail=f"Invalid ingest ID: {ingest_id!r}") | ||
try: | ||
# Validate ingest_id as a UUID | ||
if not isinstance(ingest_id, UUID): | ||
logger.error(f"Invalid ingest ID format - ingest_id: {ingest_id}") | ||
raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail=f"Invalid ingest ID format: {ingest_id!r}") | ||
|
||
# Construct and normalize the directory path | ||
directory = TMP_BASE_PATH / str(ingest_id) | ||
directory = directory.resolve(strict=False) # Resolve without strict mode to handle non-existent paths | ||
|
||
logger.info(f"Local directory path resolved - ingest_id: {ingest_id}, directory_path: {str(directory)}, tmp_base_path: {str(TMP_BASE_PATH.resolve())}") | ||
|
||
# Verify the normalized path is within TMP_BASE_PATH | ||
if not str(directory).startswith(str(TMP_BASE_PATH.resolve())): | ||
logger.error(f"Invalid ingest ID - path traversal attempt - ingest_id: {ingest_id}, directory_path: {str(directory)}, tmp_base_path: {str(TMP_BASE_PATH.resolve())}") | ||
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail=f"Invalid ingest ID: {ingest_id!r}") | ||
except Exception as exc: | ||
logger.error(f"Error during path validation - ingest_id: {ingest_id}, error_type: {type(exc).__name__}, error_message: {str(exc)}") | ||
raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail="Internal server error during path validation") | ||
|
f"Creating S3 client - endpoint: {log_config.get('endpoint_url', 'NOT_SET')}, " | ||
f"region: {log_config.get('region_name', 'NOT_SET')}, " | ||
f"has_access_key: {has_access_key}, has_secret_key: {has_secret_key}, " | ||
f"credentials_provided: {has_access_key and has_secret_key}" |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
sensitive data (secret)
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 21 days ago
To fix the issue, we will remove the logging of sensitive metadata (has_access_key
and has_secret_key
) and replace it with a generic message that does not reveal any sensitive information. This ensures that no sensitive data or metadata is exposed in the logs. Specifically:
- Modify the logging statement on line 156 to exclude
has_access_key
,has_secret_key
, andcredentials_provided
. - Replace it with a generic message indicating that the S3 client is being created, without revealing sensitive details.
No new imports, methods, or definitions are required for this fix.
-
Copy modified line R157
@@ -156,5 +156,3 @@ | ||
f"Creating S3 client - endpoint: {log_config.get('endpoint_url', 'NOT_SET')}, " | ||
f"region: {log_config.get('region_name', 'NOT_SET')}, " | ||
f"has_access_key: {has_access_key}, has_secret_key: {has_secret_key}, " | ||
f"credentials_provided: {has_access_key and has_secret_key}" | ||
f"region: {log_config.get('region_name', 'NOT_SET')}" | ||
) |
oy copilot how about you stop being an arse |
No description provided.