Skip to content

Commit 170eadb

Browse files
authored
Json pipelines docs (#1165)
* Json pipelines docs * address comment * update links * delete metadata * add meta * update s3 doc * s3 page * itmes * update bigquery page * update redshift docs * image * add Region to glue * update creating database * add file * update privilege * update * update redshift * nit * snowflake docs * remove redshift cluster info * address comments * restructre * rename path * add instruction to create pipelines * add privilege in redshift * update events path * update links * update events path * fix links * fix syntax * add snowflake * change callout syntax * fix * fix * test redirects * mdx * fix redirects * test link * nit
1 parent fa270ab commit 170eadb

29 files changed

+1198
-173
lines changed

pages/docs/data-pipelines/_meta.json

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
{
2-
"overview": "Overview",
3-
"schematized-export-pipeline": "Schematized Export Pipeline",
4-
"integrations": "Integrations"
2+
"overview": "Overview",
3+
"json-pipelines": "Data Pipelines",
4+
"integrations": "Integrations",
5+
"old-pipelines": "Older Version"
56
}

pages/docs/data-pipelines/integrations.mdx

Lines changed: 0 additions & 28 deletions
This file was deleted.
Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,8 @@
11
{
2-
"raw-aws-pipeline": "Raw AWS Pipeline",
3-
"raw-azure-pipeline": "Raw Azure Pipeline",
4-
"raw-gcs-pipeline": "Raw GCS Pipeline",
5-
"schematized-bigquery-pipeline": "Schematized BigQuery Pipeline",
6-
"schematized-aws-pipeline": "Schematized AWS Pipeline",
7-
"schematized-azure-pipeline": "Schematized Azure Pipeline",
8-
"schematized-gcs-pipeline": "Schematized GCS Pipeline",
9-
"schematized-snowflake-pipeline": "Schematized Snowflake Pipeline"
2+
"aws-s3": "AWS S3",
3+
"azure-blob-storage": "Azure Blob Storage",
4+
"bigquery": "BigQuery",
5+
"gcp-gcs": "Google Cloud Storage",
6+
"redshift-spectrum": "Redshift Spectrum",
7+
"snowflake": "Snowflake"
108
}
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# AWS S3
2+
3+
Mixpanel allows you to export events and poeple data into AWS S3 through [Json Pipelines](/docs/data-pipelines/overview)
4+
5+
## Setting S3 Permissions
6+
7+
Before detailing the steps necessary to configure permissions, it's important to note that you need to create your AWS S3 bucket.
8+
9+
Mixpanel supports various configurations to securely manage your data on AWS S3. For resource access, Mixpanel utilizes AWS cross-account roles. This section details the necessary permissions Mixpanel requires based on your S3 bucket configuration.
10+
11+
### Step 1: Create Data Modification Policy
12+
13+
To export data from Mixpanel to AWS S3, assign the following data modification permissions. Use the following policy, replacing `<BUCKET_NAME>` with the name of your bucket:
14+
15+
```json
16+
{
17+
"Version": "2012-10-17",
18+
"Statement": [
19+
{
20+
"Sid": "MixpanelS3AccessStatement",
21+
"Effect": "Allow",
22+
"Action": [
23+
"s3:PutObject",
24+
"s3:GetObject",
25+
"s3:ListBucket",
26+
"s3:DeleteObject"
27+
],
28+
"Resource": ["arn:aws:s3:::<BUCKET_NAME>", "arn:aws:s3:::<BUCKET_NAME>/*"]
29+
}
30+
]
31+
}
32+
```
33+
34+
### Step 2: Server-Side Encryption (optional)
35+
36+
Mixpanel ensures data transfer to your S3 bucket over a TLS encrypted connection. To secure your data at rest in S3, enable [Server-Side Encryption (SSE)](https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html), which offers two options: **Encryption with Amazon S3-Managed Keys (SSE-S3)** and **Encryption with AWS KMS-Managed Keys (SSE-KMS)**
37+
38+
#### Encryption with Amazon S3-Managed Keys (SSE-S3)
39+
40+
This option encrypts your data at rest using the AES-256 algorithm, with keys managed by S3. To enable this, select `AES` from the **Encryption** dropdown menu when creating pipelines.
41+
42+
#### Encryption with AWS KMS-Managed Keys (SSE-KMS)
43+
44+
For encryption with AWS KMS, you have the option to use either the default `aws/s3` key or your own custom keys.
45+
46+
- Using the Default Key
47+
48+
Simply select `KMS` from the **Encryption** dropdown menu and leave the `KMS Key ID` field empty when creating your pipeline.
49+
50+
- Using Custom Key
51+
52+
1. Select `KMS` from the **Encryption** dropdown menu and enter your custom key's ARN in the `KMS Key ID` field.
53+
54+
2. Create an IAM policy allowing Mixpanel to use your KMS key, as shown in the JSON snippet below. Replace `<KEY_ARN>` with your key's ARN:
55+
56+
```json
57+
{
58+
"Version": "2012-10-17",
59+
"Statement": [
60+
{
61+
"Sid": "MixpanelKmsStatement",
62+
"Effect": "Allow",
63+
"Action": [
64+
"kms:Decrypt",
65+
"kms:Encrypt",
66+
"kms:GenerateDataKey",
67+
"kms:ReEncryptTo",
68+
"kms:GenerateDataKeyWithoutPlaintext",
69+
"kms:DescribeKey",
70+
"kms:ReEncryptFrom"
71+
],
72+
"Resource": "<KEY_ARN>"
73+
}
74+
]
75+
}
76+
```
77+
78+
### Step 3: Create Access Role
79+
80+
After establishing the necessary policies, create a cross-account IAM Role to attach policies you've created:
81+
82+
- Go to the **IAM** service on the AWS console.
83+
- Select **Roles** in the sidebar and click **Create role**.
84+
- On the trusted entity page, choose **AWS Account**, then click **Another AWS account**, and enter `485438090326` for the **Account ID** and click **Next**.
85+
- On the permissions page, locate and attach the policies you created in previous steps (data modification and, if appliable, KMS).
86+
- On the review page, provide a name and description for this role and click **Create role**.
87+
88+
To ensure secure operations, limit the trust relationship to the Mixpanel export user:
89+
90+
- Return to the **IAM** service, select **Roles**, and locate the role you just created.
91+
- In the **Trust relationships** tab, click **Edit trust policy**.
92+
- Update the trust relationship with the following JSON, replacing `<MIXPANEL_PROJECT_TOKEN>` with your Mixpanel project token.
93+
94+
```json
95+
{
96+
"Version": "2012-10-17",
97+
"Statement": [
98+
{
99+
"Effect": "Allow",
100+
"Principal": {
101+
"AWS": "arn:aws:iam::485438090326:user/mixpanel-export"
102+
},
103+
"Action": "sts:AssumeRole",
104+
"Condition": {
105+
"StringEquals": {
106+
"sts:ExternalId": "<MIXPANEL_PROJECT_TOKEN>"
107+
}
108+
}
109+
}
110+
]
111+
}
112+
```
113+
114+
- Click **Update policy** and save.
115+
116+
This setup utilizes an external ID to prevent [the confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), enhancing the security of cross-account access as Mixpanel interacts with AWS using your project token.
117+
118+
### Step 4: Provide Mixpanel with S3 Details
119+
120+
Refer to [Step 2: Creating the Pipeline](/docs/data-pipelines/overview/#step-2-creating-the-pipeline)
121+
to create data pipeline via UI. It is essential to provide specific details to ensure that Mixpanel can accurately direct the data exports to your S3 bucket:
122+
123+
- **Bucket**: Specify the S3 bucket where Mixpanel data should be exported.
124+
- **Region**: Indicate the AWS region where your S3 bucket is located.
125+
- **Role**: Provide the AWS Role ARN that Mixpanel should assume when writing to your S3, e.g., `arn:aws:iam:::role/example-s3-role`.
126+
- **Encryption (optional)**: Specify the type of at-rest encryption used by the S3 bucket.
127+
- **KMS Key ID (optional)**: If using KMS encryption, you can provide the custom key ID that you wish to use.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Azure Blob Storage
2+
3+
Mixpanel allows you to export events and people data directly into an Azure Blob Storage instance through [Json Pipelines](/docs/data-pipelines/overview).
4+
5+
## Setting Blob Storage Permissions
6+
7+
To enable Mixpanel to write data to your Azure Blob Storage, specific permissions need to be set up because Azure authentication mechanisms do not support cross-account access. You will need to provide Mixpanel with Azure credentials linked to your Blob Storage container.
8+
9+
### Step 1: Create a Service Principal
10+
11+
Start by creating a **_Service Principal_** in your Azure Active Directory. This can be done using the Azure CLI with the following command (with `"redacted"` output). This command generates credentials in JSON format. Ensure you securely handle the output as it contains sensitive information.
12+
13+
```shell
14+
$ az ad sp create-for-rbac --sdk-auth
15+
{
16+
"clientId": "redacted",
17+
"clientSecret": "redacted",
18+
"subscriptionId": "redacted",
19+
"tenantId": "redacted",
20+
"activeDirectoryEndpointUrl": "https://login.microsoftonline.com",
21+
"resourceManagerEndpointUrl": "https://management.azure.com/",
22+
"activeDirectoryGraphResourceId": "https://graph.windows.net/",
23+
"sqlManagementEndpointUrl": "https://management.core.windows.net:8443/",
24+
"galleryEndpointUrl": "https://gallery.azure.com/",
25+
"managementEndpointUrl": "https://management.core.windows.net/"
26+
}
27+
```
28+
29+
### Step 2: Assign Role to Service Principal
30+
31+
Next, navigate to the Blob Storage container you wish to use, and assign the `"Storage Blob Data Contributor"` role to the newly created Service Principal.
32+
33+
### Step 3: Provide Mixpanel with Access Details
34+
35+
Refer to [Step 2: Creating the Pipeline](/docs/data-pipelines/overview/#step-2-creating-the-pipeline) to create data pipeline via UI. You need to provide specific details to enable authentication and data export to Azure Blob Storage. For authentication, supply the `Client Id`, `Client Secret`, and `Tenant Id`. These credentials are crucial for Mixpanel to operate as the Service Principal and ensure secure authentication without exposing broader Azure resources.
36+
37+
Additionally, to define the export destination, you must provide the `Storage Account` and `Container Name`. These details identify the exact location within Azure where your data will be exported.
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# BigQuery
2+
3+
This guide describes how Mixpanel exports your data into a customer-managed [Google BigQuery](https://cloud.google.com/bigquery/) dataset.
4+
5+
## Design
6+
7+
![image](/230698685-c02cb9a1-d66f-42a7-8063-8e78b79e7b1f.png)
8+
9+
For events data, we create a single table called `mp_master_event` and store all external properties inside the `properties` column in JSON type. Users can extract properties using JSON functions. See [Query Data](#query-data) for more details.
10+
11+
For user profiles and identity mappings, we create new tables `mp_people_data_*` and `mp_identity_mappings_data_*` with a random suffix every time and then update views `mp_people_data_view` and `mp_identity_mappings_data_view` accordingly to use the latest table. Always use the views instead of the actual tables, as we do not immediately delete old tables, and you may end up using outdated data.
12+
13+
Export logs are maintained in the `mp_nessie_export_log` table within BigQuery. This table provides detailed information such as export times, date ranges (from date & to date), and the number of rows exported. This data allows for effective monitoring and auditing of the data export processes.
14+
15+
> **Important:** Please do not modify the schema of tables generated by Mixpanel. Altering the table schema can cause the pipeline to fail to export due to schema mismatches.
16+
17+
## Setting BigQuery Permissions
18+
19+
Please follow these steps to share permissions with Mixpanel and create json pipelines.
20+
21+
### Step 1: Create a Dataset
22+
23+
Create a dataset in your BigQuery to store the Mixpanel data.
24+
25+
![image](/230698727-1216833e-8321-46de-a388-8b554a00938c.png)
26+
27+
### Step 2: Grant Permissions to Mixpanel
28+
29+
> **Note:** If your organization uses [domain restriction constraint](https://cloud.google.com/resource-manager/docs/organization-policy/restricting-domains) you will have to update the policy to allow Mixpanel domain `mixpanel.com` and Google Workspace customer ID: `C00m5wrjz`.
30+
31+
Mixpanel requires two permissions to manage the dataset:
32+
33+
**BigQuery Job User**
34+
35+
- Navigate to **IAM & Admin** in your Google Cloud Console.
36+
- Click **+ ADD** to add principals
37+
- Add new principle `[email protected]` and set the role as `BigQuery Job User`
38+
- Click the **Save** button.
39+
40+
![image](/230698732-4dadbccf-1eeb-4e64-a6c7-8926eb49e5cc.png)
41+
42+
**BigQuery Data Owner**
43+
44+
- Go to **BigQuery** in your Google Cloud Console.
45+
- Open the dataset intended for Mixpanel exports.
46+
- Click on **Sharing** and **Permissions** in the drop down.
47+
- In the Data Permissions window, click on **Add Principal**
48+
- Add new principle `[email protected]` and set the role as `BigQuery Data Owner`, and save.
49+
50+
![image](/230698735-972aedb5-1352-4ebc-82c4-ef075679779b.png)
51+
52+
### Step 3: Provide Necessary Details for Pipeline Creation
53+
54+
Refer to [Step 2: Creating the Pipeline](/docs/data-pipelines/overview/#step-2-creating-the-pipeline) to create data pipeline via UI. You need to provide specific details to enable authentication and data export to BigQuery.
55+
56+
- **GCP project ID**: The project ID where BigQuery dataset is present
57+
- **Dataset name**: Dataset created on the GCP project to which Mixpanel needs to export data
58+
- **GCP region**: The region used for BigQuery
59+
60+
## Partitioning
61+
62+
Data in the events table `mp_master_event` is partitioned based on the [`_PARTITIONTIME` pseudo column](https://cloud.google.com/bigquery/docs/querying-partitioned-tables#ingestion-time_partitioned_table_pseudo_columns) and in the project timezone.
63+
64+
Note: `TIMEPARTITIONING` should not be updated on the table. It will cause your export jobs to fail. Create a new table/view from this table for custom partitioning.
65+
66+
## Query Data
67+
68+
This section provides examples of how to query data exported to BigQuery. Refer to [BigQuery docs](https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions#json_value) for more details about using JSON functions to query properties.
69+
70+
### Get the Number of Events Each Day
71+
72+
To verify the completeness of the export process, use the following SQL query to count events per day:
73+
74+
```sql
75+
SELECT
76+
_PARTITIONTIME AS pt,
77+
COUNT(*)
78+
FROM
79+
`<your gcp project>.<your dataset>.mp_master_event`
80+
WHERE
81+
DATE(_PARTITIONTIME) <= "2024-05-31"
82+
AND DATE(_PARTITIONTIME) >= "2024-05-01"
83+
GROUP BY
84+
pt
85+
```
86+
87+
### Query identity mappings
88+
89+
When querying the identity mappings table, prioritize using the `resolved_distinct_id` over the non-resolved `distinct_id` whenever it is available. If a `resolved_distinct_id` is not available, you should revert to using the `distinct_id` from the existing people or events table.
90+
91+
Below is an example query that utilizes the identity mappings table. This query counts the number of events for each unique user in San Francisco within a specific date range.
92+
93+
```sql
94+
SELECT
95+
CASE
96+
WHEN mappings.resolved_distinct_id IS NOT NULL THEN mappings.resolved_distinct_id
97+
WHEN mappings.resolved_distinct_id IS NULL THEN events.distinct_id
98+
END
99+
AS resolved_distinct_id,
100+
COUNT(*) AS count
101+
FROM
102+
`<your gcp project>.<your dataset>.mp_master_event` events
103+
INNER JOIN
104+
`<your gcp project>.<your dataset>.mp_identity_mappings_data_view` mappings
105+
ON
106+
events.distinct_id = mappings.distinct_id
107+
AND JSON_VALUE(properties,'$."$city"') = "San Francisco"
108+
AND DATE(events._PARTITIONTIME) <= "2024-05-31"
109+
AND DATE(events._PARTITIONTIME) >= "2024-05-01"
110+
GROUP BY
111+
resolved_distinct_id
112+
LIMIT
113+
100
114+
```
115+
116+
This query demonstrates how to effectively use conditional logic and JSON functions within BigQuery to analyze user behavior based on geographic location. Additional filters on event properties can be added to refine the analysis, allowing for more detailed insights into specific user actions or behaviors.
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Google Cloud Storage
2+
3+
Mixpanel supports exporting events and people data directly to Google Cloud Storage (GCS) via [Json Pipelines](/docs/data-pipelines/overview).
4+
5+
## Setting GCS Permissions
6+
7+
To facilitate data export to Google Cloud Storage, proper permissions need to be configured to allow Mixpanel access to your GCS bucket.
8+
9+
### Step 1: Assign Roles to Service Account on Bucket
10+
11+
You must grant the `Storage Object Admin` role to the service account `[email protected]` for the bucket you are creating or intend to reuse. This role allows Mixpanel to manage storage objects on your behalf.
12+
13+
To assign this role:
14+
15+
- Navigate to the **Cloud Storage** in your Google Cloud Console and select the GCS bucket you have created or plan to reuse
16+
- Click on the **PERMISSIONS** tab and select **GRANT ACCESS**
17+
- In the new principals field, add `[email protected]` and then select `Storage Object Admin` from the role dropdown menu
18+
- Confirm the assignment by clicking the **SAVE** button.
19+
20+
This process ensures that the specified service account has the necessary permissions to efficiently manage and handle the data exported to your GCS bucket.
21+
22+
### Step 2: Provide Mixpanel with GCS Details
23+
24+
Refer to [Step 2: Creating the Pipeline](/docs/data-pipelines/overview/#step-2-creating-the-pipeline) to create data pipeline via UI. You need to provide the following details to ensure Mixpanel can accurately direct the data exports to your GCS:
25+
26+
- Bucket: The GCS bucket to export Mixpanel data to
27+
- Region: The GCS region for the bucket

0 commit comments

Comments
 (0)