TDL-23456 `reports_email_activity` stream PK update #63

luandy64 · 2023-07-20T17:47:51Z

Description of change

This PR updates the PK for reports_email_activity.

The current PK has a chance of being not unique because of the timestamp field.

Events in Mailchimp will display with a timestamp like 2023-07-20T19:56:15.123, but get returned from the API like 2023-07-20T19:56:15+00:00. The milliseconds are important if two or more events occur within the same second.

Manual QA steps

See the methodology section for an explanation of what I'm doing here

I ran the tap without this change and counted the unique PKs

$ tap-mailchimp --config config --catalog catalog-for-sync-1 > sync1
$ grep 'RECORD' sync1 \
    | jq -Sc '.record | {action, campaign_id, email_id, timestamp}' \
    | sort \
    | uniq -c \
    | awk '{print $1}' \
    | sort \
    | uniq -c

  91689 1
   2511 2
    610 3
    440 4
    178 5
     44 6
     22 7
      5 8
      1 9

The interpretation here is 91,689 PKs that appear 1 time; 2,511 PKs that appear 2 times; 610 PKs that occur 3 times, etc, etc.

Contrast that with a sync where I added a new timestamp field to the PK.

$ tap-mailchimp --config config --catalog catalog-for-sync-2 > sync2
$ grep 'RECORD' sync2 \
    | jq -Sc '.record | {_sdc_stitch_timestamp, action, campaign_id, email_id, timestamp}' \
    | sort \
    | uniq -c \
    | awk '{print $1}' \
    | sort \
    | uniq -c
 101389 1

If you tallied up the record count from the first sync, you'll get 95,500 records. I believe the second sync got more because data was added to Mailchimp between the end of the first sync and the start of the second one.

Risks

This solution is not bulletproof. Technically 6 decimal places of seconds can still collide and you would miss data

Rollback steps

revert this branch, bump the version

Methodology

By chaining together two sort | uniq -c pipelines, we can count the number of times something appears.

Given input like this that represent the primary keys of records

$ cat records
a
b
b
c
c
d
d
d
e
e
e

You can see that there are

1 a
2 b
2 c
3 d
3 e

One use of sort | uniq -c gives the same counts

$ cat records | sort | uniq -c
      1 a
      2 b
      2 c
      3 d
      3 e

But we want the count of the counts. Because if everything was unique, we would have 1 "bucket" at the end with the number of records in it. If everything was non-unique, we would see more than one bucket.

So applying a second round of sort | uniq -c:

$ cat records | sort | uniq -c | awk '{print $1}' | sort | uniq -c
      1 1
      2 2
      2 3

Multiple buckets: therefore, we started with non-unique records.

Compare that with this set of unique input:

$ cat records2
a
b
c
d
e

$ cat records2 | sort | uniq -c
      1 a
      1 b
      1 c
      1 d
      1 e

$ cat records2 | sort | uniq -c | awk '{print $1}' | sort | uniq -c
      5 1

Note: The use of awk in all of these pipelines is to just strip the leading whitespace after the first round of sort | uniq -c.

luandy64 added 4 commits July 20, 2023 17:13

Add a new field to the schema

4e5420c

Mark the new field as part of the primary key

b6175dd

Use a nanosecond-precision datetime for _sdc_stitch_timestamp

3200d72

Whitespace cleanup

0cd9394

bryantgray approved these changes Jul 20, 2023

View reviewed changes

luandy64 mentioned this pull request Jul 20, 2023

Code refactoring , Class based Implementation , Enhancements #58

Open

somethingmorerelevant self-requested a review July 21, 2023 06:31

luandy64 and others added 2 commits August 3, 2023 09:44

Make _sdc_stitch_timestamp a string

6f00bfd

Update expected pk in test

538d465

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TDL-23456 `reports_email_activity` stream PK update #63

TDL-23456 `reports_email_activity` stream PK update #63

luandy64 commented Jul 20, 2023

TDL-23456 reports_email_activity stream PK update #63

Are you sure you want to change the base?

TDL-23456 reports_email_activity stream PK update #63

Conversation

luandy64 commented Jul 20, 2023

Description of change

Manual QA steps

Risks

Rollback steps

Methodology

TDL-23456 `reports_email_activity` stream PK update #63

TDL-23456 `reports_email_activity` stream PK update #63