Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(ingest/s3): enhance readability #12609

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

eagle-25
Copy link
Contributor

@eagle-25 eagle-25 commented Feb 12, 2025

  • Refactor S3Source().get_folder_info() to enhance readability
  • Add a test to ensure that get_folder_info() returns the expected result.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

- Refactor S3Source().get_folder_info() to enhance readability
- Add a test to ensure that get_folder_info() returns the expected result.
@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Feb 12, 2025
Copy link

codecov bot commented Feb 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines Coverage Δ
...ngestion/src/datahub/ingestion/source/s3/source.py 86.58% <100.00%> (+0.05%) ⬆️

... and 52 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96022f2...4946b94. Read the comment docs.

@@ -877,51 +876,28 @@ def _is_allowed_path(path_spec_: PathSpec, s3_uri: str) -> bool:
s3_objects = (
obj
for obj in bucket.objects.filter(Prefix=prefix).page_size(PAGE_SIZE)
if _is_allowed_path(path_spec, f"s3://{obj.bucket_name}/{obj.key}")
if _is_allowed_path(path_spec, self.create_s3_path(obj.bucket_name, obj.key))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref

Comment on lines -901 to -905
if modification_time is None:
logger.warning(
f"Unable to find any files in the folder {key}. Skipping..."
)
continue
Copy link
Contributor Author

@eagle-25 eagle-25 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this because groupby_unsorted never returns empty group.

Folder(
partition_id=id,
is_partition=bool(id),
creation_time=creation_time if creation_time else None, # type: ignore[arg-type]
Copy link
Contributor Author

@eagle-25 eagle-25 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need null handling. The type of creation_time is datetime.

  1. creation_time is the value of item.last_modified. (L896)
  2. The type of item is ObjectSummary
  3. The type of ObjectSummary.last_modified is datetime.

@@ -847,7 +847,7 @@ def get_folder_info(
path_spec: PathSpec,
bucket: "Bucket",
prefix: str,
) -> List[Folder]:
) -> Iterable[Folder]:
Copy link
Contributor Author

@eagle-25 eagle-25 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this for memory efficiency.

Comment on lines +886 to 899
max_file = max(group, key=lambda x: x.last_modified)
max_file_s3_path = self.create_s3_path(max_file.bucket_name, max_file.key)

# If partition_id is None, it means the folder is not a partition
partition_id = path_spec.get_partition_from_path(max_file_s3_path)

yield Folder(
partition_id=partition_id,
is_partition=bool(partition_id),
creation_time=max_file.last_modified,
modification_time=max_file.last_modified,
sample_file=max_file_s3_path,
size=sum(obj.size for obj in group),
)
Copy link
Contributor Author

@eagle-25 eagle-25 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before After
O(n) O(2n)

The performance loss is minimal because the time complexity has only increased from O(n) to O(2n).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why O(2n)? Because max() and sum().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant