Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist emitTime from IngestionTrackingContext to the new entity tables and use emitTime during backfill #329

Merged
merged 6 commits into from
Dec 12, 2023

Conversation

derekpham
Copy link
Contributor

Context

This is continuation of #328
This PR makes 2 changes:

  • Persist emitTime to the new entity table if IngestionTrackingContext is provided
  • Use emitTime when checking whether we should backfill the entity
    If oldEmitTime is available and smaller than new event's emit time -> backfill
    If oldEmitTime is not available and last modified time smaller than new event's emit time -> backfill

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

…e emitTime field when during backfill logic if available
Copy link
Contributor

@jsdonn jsdonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm, small comments

*
* This will be null in the following scenarios:
* - The record is still on the old schema
* - The record inserted before we started persisting emitTime to the new schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: was?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"was still on the old schema"?
or "record was inserted"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the latter makes more sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

*
* This will be null in the following scenarios:
* - The record is still on the old schema
* - The record inserted before we started persisting emitTime to the new schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I guess we won't backfill these values and just fall back on the naive approach if this val doesn't exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's the plan since most rows will be missing emitTime anyway. Based on the last discussion, we'll use lastmodifiedtime in place of emitTime for comparison.

@Nonnull Class<ASPECT> aspectClass,
@Nonnull AuditStamp auditStamp,
@Nullable Timestamp oldTimestamp,
@Nullable IngestionTrackingContext ingestionTrackingContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

messageId was never used before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope. Intellij didn't have much trouble removing that param.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean we'll have a new messageId during the backfill? or the backfill's messageId is always null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the param was before I added the backfill feature. It was an unused param.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked offline with @yangyangv2 . For the method add, I will switch param messageId with the ingestionTrackingContext

@derekpham
Copy link
Contributor Author

@jsdonn should we also reset isBackfill back to false when emitting MAE?

@jsdonn
Copy link
Contributor

jsdonn commented Dec 5, 2023

@derekpham oh yeah that's a good idea. Even though it should be default to false (right?), definitely wouldn't hurt to add it explicitly there during regular MAE emission

@derekpham
Copy link
Contributor Author

@jsdonn It's default is false but for backfill MCE we should explicitly set it back to false?

@@ -555,6 +559,10 @@ private <ASPECT extends RecordTemplate> ASPECT_UNION unwrapAddResultToUnion(URN

private <ASPECT extends RecordTemplate> ASPECT unwrapAddResult(URN urn, AddResult<ASPECT> result, @Nonnull AuditStamp auditStamp,
@Nullable IngestionTrackingContext trackingContext) {
if (trackingContext != null) {
trackingContext.setBackfill(false); // reset backfill since MAE won't be a backfill event
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what scenario trackingContext will be null and we want to enforce it's setBackfill to be false?

Can we log such as enforcement activity to capture unexpected cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's null when someone calls ingest instead of ingestWithTracking. I can't think of a case where we don't want to set backfill to false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^ this isBackfill field in the ingestionTrackingContext is ONLY used for MCE backfills right? I believe Jinxin added a separate field for MAE backfill. If that's the case then yeah we always want to set it to false during MAE emission. But I hope that this is clear enough for ourselves and future devs...might be a little confusing to have 2 backfill related fields in an MAE.

@@ -37,6 +37,11 @@ record AuditedAspect {
* This value is different from lastmodifiedon / the timestamp in AuditStamp since auditStamp
* is created when the restli resource receives the ingestion request.
* This is set by the MCE producers (or MCE consumers if not set by producers)
*
* This will be null in the following scenarios:
* - The record is still on the old schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do you mean "The record is from the old schema"? to how understand "still on"?

@Nonnull Class<ASPECT> aspectClass,
@Nonnull AuditStamp auditStamp,
@Nullable Timestamp oldTimestamp,
@Nullable IngestionTrackingContext ingestionTrackingContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean we'll have a new messageId during the backfill? or the backfill's messageId is always null?

@@ -151,6 +156,9 @@ public <ASPECT extends RecordTemplate> int addWithOptimisticLocking(@Nonnull URN
.setLastmodifiedby(actor)
.setLastmodifiedon(new Timestamp(timestamp).toString())
.setCreatedfor(impersonator, SetMode.IGNORE_NULL);
if (ingestionTrackingContext != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be safer and more clear to still create IngestionTrackingContext in the input and here we check


} ```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your comment is missing some code?

&& trackingContext.getEmitTime() > oldAuditStamp.getTime());

log.info("Encounter backfill event. Tracking context: {}. Urn: {}. Aspect class: {}. Old audit stamp: {}. "
// ingestionTrackingContext if not null should always have emitTime. If emitTime doesn't exist within
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the expectation in IngestionTrackingContext? the model definition is optional for emitTime.

Can we make the logic/expectation to be consistent with the model definition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comment

Copy link
Contributor

@jsdonn jsdonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. One comment on having 2 backfill-related fields in the MAE but I guess we're kinda past the modeling phase.

Please get @yangyangv2 's approval for ship-it.

Copy link
Contributor

@yangyangv2 yangyangv2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last minor comment. The rest LGTM. Thanks for addressing all the comments!

@@ -93,8 +92,8 @@ public void ensureSchemaUpToDate() {
@Override
@Transactional
public <ASPECT extends RecordTemplate> int add(@Nonnull URN urn, @Nullable ASPECT newValue, @Nonnull Class<ASPECT> aspectClass,
@Nonnull AuditStamp auditStamp, @Nullable UUID messageId) {
return addWithOptimisticLocking(urn, newValue, aspectClass, auditStamp, null, null);
@Nonnull AuditStamp auditStamp, IngestionTrackingContext ingestionTrackingContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put a @nullable tag on ingestionTrackingContext.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@yangyangv2 yangyangv2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@yangyangv2 yangyangv2 merged commit 5874c4f into linkedin:master Dec 12, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants