TDL-23456 reports_email_activity
stream PK update
#63
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of change
This PR updates the PK for
reports_email_activity
.The current PK has a chance of being not unique because of the
timestamp
field.Events in Mailchimp will display with a timestamp like
2023-07-20T19:56:15.123
, but get returned from the API like2023-07-20T19:56:15+00:00
. The milliseconds are important if two or more events occur within the same second.Manual QA steps
See the methodology section for an explanation of what I'm doing here
I ran the tap without this change and counted the unique PKs
The interpretation here is 91,689 PKs that appear 1 time; 2,511 PKs that appear 2 times; 610 PKs that occur 3 times, etc, etc.
Contrast that with a sync where I added a new timestamp field to the PK.
If you tallied up the record count from the first sync, you'll get 95,500 records. I believe the second sync got more because data was added to Mailchimp between the end of the first sync and the start of the second one.
Risks
Rollback steps
Methodology
By chaining together two
sort | uniq -c
pipelines, we can count the number of times something appears.Given input like this that represent the primary keys of records
You can see that there are
a
b
c
d
e
One use of
sort | uniq -c
gives the same countsBut we want the count of the counts. Because if everything was unique, we would have 1 "bucket" at the end with the number of records in it. If everything was non-unique, we would see more than one bucket.
So applying a second round of
sort | uniq -c
:Multiple buckets: therefore, we started with non-unique records.
Compare that with this set of unique input:
Note: The use of
awk
in all of these pipelines is to just strip the leading whitespace after the first round ofsort | uniq -c
.