mixpanel
diff --git a/‎pages/docs/data-pipelines/_meta.json
Lines changed: 4 additions & 3 deletions b/‎pages/docs/data-pipelines/_meta.json
Lines changed: 4 additions & 3 deletions
diff --git a/‎pages/docs/data-pipelines/integrations.mdx
Lines changed: 0 additions & 28 deletions b/‎pages/docs/data-pipelines/integrations.mdx
Lines changed: 0 additions & 28 deletions
diff --git a/‎pages/docs/data-pipelines/integrations/_meta.json
Lines changed: 6 additions & 8 deletions b/‎pages/docs/data-pipelines/integrations/_meta.json
Lines changed: 6 additions & 8 deletions
diff --git a/‎pages/docs/data-pipelines/integrations/aws-s3.md
Lines changed: 127 additions & 0 deletions b/‎pages/docs/data-pipelines/integrations/aws-s3.md
Lines changed: 127 additions & 0 deletions
diff --git a/‎pages/docs/data-pipelines/integrations/azure-blob-storage.md
Lines changed: 37 additions & 0 deletions b/‎pages/docs/data-pipelines/integrations/azure-blob-storage.md
Lines changed: 37 additions & 0 deletions
diff --git a/‎pages/docs/data-pipelines/integrations/bigquery.md
Lines changed: 116 additions & 0 deletions b/‎pages/docs/data-pipelines/integrations/bigquery.md
Lines changed: 116 additions & 0 deletions
diff --git a/‎pages/docs/data-pipelines/integrations/gcp-gcs.md
Lines changed: 27 additions & 0 deletions b/‎pages/docs/data-pipelines/integrations/gcp-gcs.md
Lines changed: 27 additions & 0 deletions
@@ -1,5 +1,6 @@
 {
-    "overview": "Overview",
-    "schematized-export-pipeline": "Schematized Export Pipeline",
-    "integrations": "Integrations"
+  "overview": "Overview",
+  "json-pipelines": "Data Pipelines",
+  "integrations": "Integrations",
+  "old-pipelines": "Older Version"
 }
@@ -1,10 +1,8 @@
 {
-    "raw-aws-pipeline": "Raw AWS Pipeline",
-    "raw-azure-pipeline": "Raw Azure Pipeline",
-    "raw-gcs-pipeline": "Raw GCS Pipeline",
-    "schematized-bigquery-pipeline": "Schematized BigQuery Pipeline",
-    "schematized-aws-pipeline": "Schematized AWS Pipeline",
-    "schematized-azure-pipeline": "Schematized Azure Pipeline",
-    "schematized-gcs-pipeline": "Schematized GCS Pipeline",
-    "schematized-snowflake-pipeline": "Schematized Snowflake Pipeline"
+  "aws-s3": "AWS S3",
+  "azure-blob-storage": "Azure Blob Storage",
+  "bigquery": "BigQuery",
+  "gcp-gcs": "Google Cloud Storage",
+  "redshift-spectrum": "Redshift Spectrum",
+  "snowflake": "Snowflake"
 }
@@ -0,0 +1,127 @@
+# AWS S3
+
+Mixpanel allows you to export events and poeple data into AWS S3 through [Json Pipelines](/docs/data-pipelines/overview)
+
+## Setting S3 Permissions
+
+Before detailing the steps necessary to configure permissions, it's important to note that you need to create your AWS S3 bucket.
+
+Mixpanel supports various configurations to securely manage your data on AWS S3. For resource access, Mixpanel utilizes AWS cross-account roles. This section details the necessary permissions Mixpanel requires based on your S3 bucket configuration.
+
+### Step 1: Create Data Modification Policy
+
+To export data from Mixpanel to AWS S3, assign the following data modification permissions. Use the following policy, replacing `<BUCKET_NAME>` with the name of your bucket:
+
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Sid": "MixpanelS3AccessStatement",
+      "Effect": "Allow",
+      "Action": [
+        "s3:PutObject",
+        "s3:GetObject",
+        "s3:ListBucket",
+        "s3:DeleteObject"
+      ],
+      "Resource": ["arn:aws:s3:::<BUCKET_NAME>", "arn:aws:s3:::<BUCKET_NAME>/*"]
+    }
+  ]
+}
+```
+
+### Step 2: Server-Side Encryption (optional)
+
+Mixpanel ensures data transfer to your S3 bucket over a TLS encrypted connection. To secure your data at rest in S3, enable [Server-Side Encryption (SSE)](https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html), which offers two options: **Encryption with Amazon S3-Managed Keys (SSE-S3)** and **Encryption with AWS KMS-Managed Keys (SSE-KMS)**
+
+#### Encryption with Amazon S3-Managed Keys (SSE-S3)
+
+This option encrypts your data at rest using the AES-256 algorithm, with keys managed by S3. To enable this, select `AES` from the **Encryption** dropdown menu when creating pipelines.
+
+#### Encryption with AWS KMS-Managed Keys (SSE-KMS)
+
+For encryption with AWS KMS, you have the option to use either the default `aws/s3` key or your own custom keys.
+
+- Using the Default Key
+
+  Simply select `KMS` from the **Encryption** dropdown menu and leave the `KMS Key ID` field empty when creating your pipeline.
+
+- Using Custom Key
+
+  1. Select `KMS` from the **Encryption** dropdown menu and enter your custom key's ARN in the `KMS Key ID` field.
+
+  2. Create an IAM policy allowing Mixpanel to use your KMS key, as shown in the JSON snippet below. Replace `<KEY_ARN>` with your key's ARN:
+
+     ```json
+     {
+       "Version": "2012-10-17",
+       "Statement": [
+         {
+           "Sid": "MixpanelKmsStatement",
+           "Effect": "Allow",
+           "Action": [
+             "kms:Decrypt",
+             "kms:Encrypt",
+             "kms:GenerateDataKey",
+             "kms:ReEncryptTo",
+             "kms:GenerateDataKeyWithoutPlaintext",
+             "kms:DescribeKey",
+             "kms:ReEncryptFrom"
+           ],
+           "Resource": "<KEY_ARN>"
+         }
+       ]
+     }
+     ```
+
+### Step 3: Create Access Role
+
+After establishing the necessary policies, create a cross-account IAM Role to attach policies you've created:
+
+- Go to the **IAM** service on the AWS console.
+- Select **Roles** in the sidebar and click **Create role**.
+- On the trusted entity page, choose **AWS Account**, then click **Another AWS account**, and enter `485438090326` for the **Account ID** and click **Next**.
+- On the permissions page, locate and attach the policies you created in previous steps (data modification and, if appliable, KMS).
+- On the review page, provide a name and description for this role and click **Create role**.
+
+To ensure secure operations, limit the trust relationship to the Mixpanel export user:
+
+- Return to the **IAM** service, select **Roles**, and locate the role you just created.
+- In the **Trust relationships** tab, click **Edit trust policy**.
+- Update the trust relationship with the following JSON, replacing `<MIXPANEL_PROJECT_TOKEN>` with your Mixpanel project token.
+
+  ```json
+  {
+    "Version": "2012-10-17",
+    "Statement": [
+      {
+        "Effect": "Allow",
+        "Principal": {
+          "AWS": "arn:aws:iam::485438090326:user/mixpanel-export"
+        },
+        "Action": "sts:AssumeRole",
+        "Condition": {
+          "StringEquals": {
+            "sts:ExternalId": "<MIXPANEL_PROJECT_TOKEN>"
+          }
+        }
+      }
+    ]
+  }
+  ```
+
+- Click **Update policy** and save.
+
+This setup utilizes an external ID to prevent [the confused deputy problem](https://docs.aws.amazon.com/IAM/latest/UserGuide/confused-deputy.html), enhancing the security of cross-account access as Mixpanel interacts with AWS using your project token.
+
+### Step 4: Provide Mixpanel with S3 Details
+
+Refer to [Step 2: Creating the Pipeline](/docs/data-pipelines/overview/#step-2-creating-the-pipeline)
+to create data pipeline via UI. It is essential to provide specific details to ensure that Mixpanel can accurately direct the data exports to your S3 bucket:
+
+- **Bucket**: Specify the S3 bucket where Mixpanel data should be exported.
+- **Region**: Indicate the AWS region where your S3 bucket is located.
+- **Role**: Provide the AWS Role ARN that Mixpanel should assume when writing to your S3, e.g., `arn:aws:iam:::role/example-s3-role`.
+- **Encryption (optional)**: Specify the type of at-rest encryption used by the S3 bucket.
+- **KMS Key ID (optional)**: If using KMS encryption, you can provide the custom key ID that you wish to use.
@@ -0,0 +1,37 @@
+# Azure Blob Storage
+
+Mixpanel allows you to export events and people data directly into an Azure Blob Storage instance through [Json Pipelines](/docs/data-pipelines/overview).
+
+## Setting Blob Storage Permissions
+
+To enable Mixpanel to write data to your Azure Blob Storage, specific permissions need to be set up because Azure authentication mechanisms do not support cross-account access. You will need to provide Mixpanel with Azure credentials linked to your Blob Storage container.
+
+### Step 1: Create a Service Principal
+
+Start by creating a **_Service Principal_** in your Azure Active Directory. This can be done using the Azure CLI with the following command (with `"redacted"` output). This command generates credentials in JSON format. Ensure you securely handle the output as it contains sensitive information.
+
+```shell
+$ az ad sp create-for-rbac --sdk-auth
+{
+  "clientId": "redacted",
+  "clientSecret": "redacted",
+  "subscriptionId": "redacted",
+  "tenantId": "redacted",
+  "activeDirectoryEndpointUrl": "https://login.microsoftonline.com",
+  "resourceManagerEndpointUrl": "https://management.azure.com/",
+  "activeDirectoryGraphResourceId": "https://graph.windows.net/",
+  "sqlManagementEndpointUrl": "https://management.core.windows.net:8443/",
+  "galleryEndpointUrl": "https://gallery.azure.com/",
+  "managementEndpointUrl": "https://management.core.windows.net/"
+}
+```
+
+### Step 2: Assign Role to Service Principal
+
+Next, navigate to the Blob Storage container you wish to use, and assign the `"Storage Blob Data Contributor"` role to the newly created Service Principal.
+
+### Step 3: Provide Mixpanel with Access Details
+
+Refer to [Step 2: Creating the Pipeline](/docs/data-pipelines/overview/#step-2-creating-the-pipeline) to create data pipeline via UI. You need to provide specific details to enable authentication and data export to Azure Blob Storage. For authentication, supply the `Client Id`, `Client Secret`, and `Tenant Id`. These credentials are crucial for Mixpanel to operate as the Service Principal and ensure secure authentication without exposing broader Azure resources.
+
+Additionally, to define the export destination, you must provide the `Storage Account` and `Container Name`. These details identify the exact location within Azure where your data will be exported.
@@ -0,0 +1,116 @@
+# BigQuery
+
+This guide describes how Mixpanel exports your data into a customer-managed [Google BigQuery](https://cloud.google.com/bigquery/) dataset.
+
+## Design
+
+![image](/230698685-c02cb9a1-d66f-42a7-8063-8e78b79e7b1f.png)
+
+For events data, we create a single table called `mp_master_event` and store all external properties inside the `properties` column in JSON type. Users can extract properties using JSON functions. See [Query Data](#query-data) for more details.
+
+For user profiles and identity mappings, we create new tables `mp_people_data_*` and `mp_identity_mappings_data_*` with a random suffix every time and then update views `mp_people_data_view` and `mp_identity_mappings_data_view` accordingly to use the latest table. Always use the views instead of the actual tables, as we do not immediately delete old tables, and you may end up using outdated data.
+
+Export logs are maintained in the `mp_nessie_export_log` table within BigQuery. This table provides detailed information such as export times, date ranges (from date & to date), and the number of rows exported. This data allows for effective monitoring and auditing of the data export processes.
+
+> **Important:** Please do not modify the schema of tables generated by Mixpanel. Altering the table schema can cause the pipeline to fail to export due to schema mismatches.
+
+## Setting BigQuery Permissions
+
+Please follow these steps to share permissions with Mixpanel and create json pipelines.
+
+### Step 1: Create a Dataset
+
+Create a dataset in your BigQuery to store the Mixpanel data.
+
+![image](/230698727-1216833e-8321-46de-a388-8b554a00938c.png)
+
+### Step 2: Grant Permissions to Mixpanel
+
+> **Note:** If your organization uses [domain restriction constraint](https://cloud.google.com/resource-manager/docs/organization-policy/restricting-domains) you will have to update the policy to allow Mixpanel domain `mixpanel.com` and Google Workspace customer ID: `C00m5wrjz`.
+
+Mixpanel requires two permissions to manage the dataset:
+
+**BigQuery Job User**
+
+- Navigate to **IAM & Admin** in your Google Cloud Console.
+- Click **+ ADD** to add principals
+- Add new principle `[email protected]` and set the role as `BigQuery Job User`
+- Click the **Save** button.
+
+![image](/230698732-4dadbccf-1eeb-4e64-a6c7-8926eb49e5cc.png)
+
+**BigQuery Data Owner**
+
+- Go to **BigQuery** in your Google Cloud Console.
+- Open the dataset intended for Mixpanel exports.
+- Click on **Sharing** and **Permissions** in the drop down.
+- In the Data Permissions window, click on **Add Principal**
+- Add new principle `[email protected]` and set the role as `BigQuery Data Owner`, and save.
+
+![image](/230698735-972aedb5-1352-4ebc-82c4-ef075679779b.png)
+
+### Step 3: Provide Necessary Details for Pipeline Creation
+
+Refer to [Step 2: Creating the Pipeline](/docs/data-pipelines/overview/#step-2-creating-the-pipeline) to create data pipeline via UI. You need to provide specific details to enable authentication and data export to BigQuery.
+
+- **GCP project ID**: The project ID where BigQuery dataset is present
+- **Dataset name**: Dataset created on the GCP project to which Mixpanel needs to export data
+- **GCP region**: The region used for BigQuery
+
+## Partitioning
+
+Data in the events table `mp_master_event` is partitioned based on the [`_PARTITIONTIME` pseudo column](https://cloud.google.com/bigquery/docs/querying-partitioned-tables#ingestion-time_partitioned_table_pseudo_columns) and in the project timezone.
+
+Note: `TIMEPARTITIONING` should not be updated on the table. It will cause your export jobs to fail. Create a new table/view from this table for custom partitioning.
+
+## Query Data
+
+This section provides examples of how to query data exported to BigQuery. Refer to [BigQuery docs](https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions#json_value) for more details about using JSON functions to query properties.
+
+### Get the Number of Events Each Day
+
+To verify the completeness of the export process, use the following SQL query to count events per day:
+
+```sql
+SELECT
+  _PARTITIONTIME AS pt,
+  COUNT(*)
+FROM
+  `<your gcp project>.<your dataset>.mp_master_event`
+WHERE
+  DATE(_PARTITIONTIME) <= "2024-05-31"
+  AND DATE(_PARTITIONTIME) >= "2024-05-01"
+GROUP BY
+  pt
+```
+
+### Query identity mappings
+
+When querying the identity mappings table, prioritize using the `resolved_distinct_id` over the non-resolved `distinct_id` whenever it is available. If a `resolved_distinct_id` is not available, you should revert to using the `distinct_id` from the existing people or events table.
+
+Below is an example query that utilizes the identity mappings table. This query counts the number of events for each unique user in San Francisco within a specific date range.
+
+```sql
+SELECT
+  CASE
+    WHEN mappings.resolved_distinct_id IS NOT NULL THEN mappings.resolved_distinct_id
+    WHEN mappings.resolved_distinct_id IS NULL THEN events.distinct_id
+END
+  AS resolved_distinct_id,
+  COUNT(*) AS count
+FROM
+  `<your gcp project>.<your dataset>.mp_master_event` events
+INNER JOIN
+  `<your gcp project>.<your dataset>.mp_identity_mappings_data_view` mappings
+ON
+  events.distinct_id = mappings.distinct_id
+  AND JSON_VALUE(properties,'$."$city"') = "San Francisco"
+  AND DATE(events._PARTITIONTIME) <= "2024-05-31"
+  AND DATE(events._PARTITIONTIME) >= "2024-05-01"
+GROUP BY
+  resolved_distinct_id
+LIMIT
+  100
+```
+
+This query demonstrates how to effectively use conditional logic and JSON functions within BigQuery to analyze user behavior based on geographic location. Additional filters on event properties can be added to refine the analysis, allowing for more detailed insights into specific user actions or behaviors.
@@ -0,0 +1,27 @@
+# Google Cloud Storage
+
+Mixpanel supports exporting events and people data directly to Google Cloud Storage (GCS) via [Json Pipelines](/docs/data-pipelines/overview).
+
+## Setting GCS Permissions
+
+To facilitate data export to Google Cloud Storage, proper permissions need to be configured to allow Mixpanel access to your GCS bucket.
+
+### Step 1: Assign Roles to Service Account on Bucket
+
+You must grant the `Storage Object Admin` role to the service account `[email protected]` for the bucket you are creating or intend to reuse. This role allows Mixpanel to manage storage objects on your behalf.
+
+To assign this role:
+
+- Navigate to the **Cloud Storage** in your Google Cloud Console and select the GCS bucket you have created or plan to reuse
+- Click on the **PERMISSIONS** tab and select **GRANT ACCESS**
+- In the new principals field, add `[email protected]` and then select `Storage Object Admin` from the role dropdown menu
+- Confirm the assignment by clicking the **SAVE** button.
+
+This process ensures that the specified service account has the necessary permissions to efficiently manage and handle the data exported to your GCS bucket.
+
+### Step 2: Provide Mixpanel with GCS Details
+
+Refer to [Step 2: Creating the Pipeline](/docs/data-pipelines/overview/#step-2-creating-the-pipeline) to create data pipeline via UI. You need to provide the following details to ensure Mixpanel can accurately direct the data exports to your GCS:
+
+- Bucket: The GCS bucket to export Mixpanel data to
+- Region: The GCS region for the bucket
Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,6 @@`
`1`	`1`	`{`
`2`		`- "overview": "Overview",`
`3`		`- "schematized-export-pipeline": "Schematized Export Pipeline",`
`4`		`- "integrations": "Integrations"`
	`2`	`+ "overview": "Overview",`
	`3`	`+ "json-pipelines": "Data Pipelines",`
	`4`	`+ "integrations": "Integrations",`
	`5`	`+ "old-pipelines": "Older Version"`
`5`	`6`	`}`