Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TDL-23456 reports_email_activity stream PK update #63

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

luandy64
Copy link
Contributor

Description of change

This PR updates the PK for reports_email_activity.

The current PK has a chance of being not unique because of the timestamp field.

Events in Mailchimp will display with a timestamp like 2023-07-20T19:56:15.123, but get returned from the API like 2023-07-20T19:56:15+00:00. The milliseconds are important if two or more events occur within the same second.

Manual QA steps

See the methodology section for an explanation of what I'm doing here

I ran the tap without this change and counted the unique PKs

$ tap-mailchimp --config config --catalog catalog-for-sync-1 > sync1
$ grep 'RECORD' sync1 \
    | jq -Sc '.record | {action, campaign_id, email_id, timestamp}' \
    | sort \
    | uniq -c \
    | awk '{print $1}' \
    | sort \
    | uniq -c

  91689 1
   2511 2
    610 3
    440 4
    178 5
     44 6
     22 7
      5 8
      1 9

The interpretation here is 91,689 PKs that appear 1 time; 2,511 PKs that appear 2 times; 610 PKs that occur 3 times, etc, etc.

Contrast that with a sync where I added a new timestamp field to the PK.

$ tap-mailchimp --config config --catalog catalog-for-sync-2 > sync2
$ grep 'RECORD' sync2 \
    | jq -Sc '.record | {_sdc_stitch_timestamp, action, campaign_id, email_id, timestamp}' \
    | sort \
    | uniq -c \
    | awk '{print $1}' \
    | sort \
    | uniq -c
 101389 1

If you tallied up the record count from the first sync, you'll get 95,500 records. I believe the second sync got more because data was added to Mailchimp between the end of the first sync and the start of the second one.

Risks

  • This solution is not bulletproof. Technically 6 decimal places of seconds can still collide and you would miss data

Rollback steps

  • revert this branch, bump the version

Methodology

By chaining together two sort | uniq -c pipelines, we can count the number of times something appears.

Given input like this that represent the primary keys of records

$ cat records
a
b
b
c
c
d
d
d
e
e
e

You can see that there are

  • 1 a
  • 2 b
  • 2 c
  • 3 d
  • 3 e

One use of sort | uniq -c gives the same counts

$ cat records | sort | uniq -c
      1 a
      2 b
      2 c
      3 d
      3 e

But we want the count of the counts. Because if everything was unique, we would have 1 "bucket" at the end with the number of records in it. If everything was non-unique, we would see more than one bucket.

So applying a second round of sort | uniq -c:

$ cat records | sort | uniq -c | awk '{print $1}' | sort | uniq -c
      1 1
      2 2
      2 3

Multiple buckets: therefore, we started with non-unique records.

Compare that with this set of unique input:

$ cat records2
a
b
c
d
e

$ cat records2 | sort | uniq -c
      1 a
      1 b
      1 c
      1 d
      1 e

$ cat records2 | sort | uniq -c | awk '{print $1}' | sort | uniq -c
      5 1

Note: The use of awk in all of these pipelines is to just strip the leading whitespace after the first round of sort | uniq -c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants