Skip to content

Commit 11db57e

Browse files
authored
Address-Docs-Feedback-2025-05-14 (#1872)
* add guide to create cohort via CSV * remove reference to record_collect_fonts * add time_event to Javascript SDK * add clarification about not set values for UTM tracking and marketing attribution * update pie chart support for comparisons * add requirement to send events to merge IDs for Simplified API * add deduplication mechanism to dev docs * add future timestamp correction warning * spelling
1 parent a3e6648 commit 11db57e

File tree

15 files changed

+147
-68
lines changed

15 files changed

+147
-68
lines changed
Lines changed: 62 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,28 @@
1-
Event deduplication allows a project to send the same exact event while only recording that event once.
2-
Deduplication only occurs when a subset of the event data is exactly identical.
1+
Mixpanel provides an event deduplication mechanism to ensure that duplicate events do not skew your analytics. Deduplication is essential when events may be sent multiple times due to network retries, client-side batching, or integration with multiple data sources.
2+
3+
<br />
4+
5+
## How Deduplication Works
6+
7+
Mixpanel deduplicates events using a combination of four key event properties:
8+
9+
- Event Name (`event`)
10+
- Distinct ID (`distinct_id`)
11+
- Timestamp (`time`)
12+
- Insert ID (`$insert_id`)
13+
14+
If all four of these properties are identical across two or more events, Mixpanel considers them duplicates and will only show the most recent version of that event in your reports. This applies regardless of whether the events are sent via SDKs, APIs, or other integrations.
15+
16+
The `$insert_id` should be a randomly generated, unique value for each event to ensure proper deduplication. If `$insert_id` are reused, events may be unintentionally deduplicated.
17+
18+
Only the four key event properties listed above are used for deduplication. Additional event properties are not considered for the deduplication mechanism. For example, if two events share the same Event Name, Distinct ID, Timestamp, and Insert ID, but have different $city value, they are still considered duplicate events.
19+
20+
### Deduplication Example
21+
22+
Deduplication occurs when a subset of the event data (event name, distinct_id, timestamp, $insert_id) is identical. Other event properties are not considered.
323

424
**Required [Event Object](doc:data-model#anatomy-of-an-event) attributes**
25+
526
[block:parameters]
627
{
728
"data": {
@@ -13,6 +34,7 @@ Deduplication only occurs when a subset of the event data is exactly identical.
1334
"0-2": "A name for the event. For example, \"Signed up\", or \"Uploaded Photo\".",
1435
"1-0": "**properties**",
1536
"1-1": "<span style=\"font-family: courier\">Object</span></br><span style=\"color: red\">required</span>",
37+
"1-2": "",
1638
"2-0": "**properties.distinct_id**",
1739
"2-1": "<span style=\"font-family: courier\">String</span></br><span style=\"color: red\">required</span>",
1840
"2-2": "The value of `distinct_id` will be treated as a string, and used to uniquely identify a user associated with your event. If you provide a distinct_id property with your events, you can track a given user through funnels and distinguish unique users for retention analyses. You should always send the same distinct_id when an event is triggered by the same user.",
@@ -27,32 +49,56 @@ Deduplication only occurs when a subset of the event data is exactly identical.
2749
"5-2": "A unique UUID tied to exactly one occurrence of an event."
2850
},
2951
"cols": 3,
30-
"rows": 6
52+
"rows": 6,
53+
"align": [
54+
"left",
55+
"left",
56+
"left"
57+
]
3158
}
3259
[/block]
3360

34-
In other words, each event containing an $insert_id is checked for duplication after being minimized to the following shape:
61+
62+
In other words, each event containing an `$insert_id` is checked for duplication after being minimized to the following shape:
3563

3664
```json
3765
{
38-
"event": "Back to Back",
66+
"event": "Item Purchased",
3967
"properties": {
40-
"token": "project_token",
41-
"distinct_id": "[email protected]",
68+
"token": "my_project_token",
69+
"distinct_id": "user123xyz",
4270
"time": 1601412131000,
4371
"$insert_id": "88B7hahbaschhhB66cbsg"
4472
},
4573
}
4674
```
4775

48-
If this simplified object is an exact match to any other simplified event it is marked as a duplicate. Ingested events that have been marked as a duplicate will be deleted within 24 hours.
76+
If this minimized event object is an exact match to any other minimized event object, it is marked as a duplicate. Ingested events that have been marked as a duplicates will be deduplicated.
4977

50-
If an event is sent to the Ingestion API without an `$insert_id` one will be generated for it. However, it will not qualify for the deduplication process.
78+
If an event is sent to the Ingestion API without an `$insert_id`, one will be generated for it. However, it will not qualify for the deduplication process.
5179

52-
[block:callout]
53-
{
54-
"type": "warning",
55-
"title": "Deduplication does not rewrite data",
56-
"body": "Using $insert_id is only used to prevent duplicate event data. It cannot be used to update, replace, or delete existing events."
57-
}
58-
[/block]
80+
## Deduplication Mechanisms
81+
82+
Mixpanel uses two main deduplication processes:
83+
84+
### Query-Time Deduplication
85+
86+
- When: Happens immediately when you query data in the Mixpanel UI.
87+
- How: If multiple events share the same event_name, distinct_id, timestamp, and $insert_id, only the most recent version of the event is shown in reports (based on the API ingestion time). This ensures that duplicate events do not affect your analytics in real time.
88+
- Scope: This deduplication is visible in the Mixpanel UI and reports, but not in raw data exports. Raw event export will contain all data as they were ingested, without any deduplication.
89+
90+
### Compaction-Time Deduplication
91+
92+
- When: Runs periodically in the backend, typically after a few hours and again after about 20 days, once data ingestion for a day is complete.
93+
- How: During compaction, Mixpanel scans for events with the same event name, distinct_id, and $insert_id (timestamp does not need to match exactly, just the same calendar day). The older event is deleted, and only the latest remains in storage.
94+
- Scope: This process helps reduce storage of duplicate events and may affect event counts if duplicates were present with different timestamps
95+
96+
<br />
97+
98+
## Important Notes
99+
100+
**Raw Event Export** - Deduplication is not applied to raw data exports. If you export events via the API, you may see duplicates. It is recommended to apply the same deduplication logic (event name, distinct_id, timestamp, $insert_id) to your exported data
101+
102+
**Insert ID Best Practice** - Always generate a unique $insert_id for each event. Reusing $insert_id (e.g., setting it to the user’s distinct_id) can cause unintended deduplication and data loss
103+
104+
**Deduplication Timing** - Query-time deduplication is immediate. Compaction-time deduplication timing is not guaranteed and may take hours to days to complete.

openapi/src/ingestion.openapi.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ paths:
101101
time:
102102
type: integer
103103
title: time
104-
description: The time at which the event occurred, in seconds or milliseconds since UTC epoch.
104+
description: The time at which the event occurred, in seconds or milliseconds since UTC epoch. If the time value is set in the future, it will be overwritten with the current present time at ingestion.
105105
distinct_id:
106106
type: string
107107
title: distinct_id
@@ -163,7 +163,7 @@ paths:
163163
time:
164164
type: integer
165165
title: time
166-
description: The time at which the event occurred, in seconds or milliseconds since UTC epoch.
166+
description: The time at which the event occurred, in seconds or milliseconds since UTC epoch. If the time value is set in the future, it will be overwritten with the current present time at ingestion.
167167
distinct_id:
168168
type: string
169169
title: distinct_id

pages/docs/data-structure/user-profiles.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,9 +159,11 @@ See here for more on how to [import](https://docs.mixpanel.com/docs/tracking-met
159159
Historical properties can be used anywhere that regular profile properties can be used.
160160

161161
For eg, when you apply breakdown by historical plan-type property, the property value will be picked based on the time of the event, instead of the current property value.
162+
162163
![image](/historical_property_value.webp)
163164

164165
When you hover over a historical property, the context menu that pops up will show that the property was sourced from a history table, as well as the name of the source. This means that the value of the property used in charts can vary over time.
166+
165167
![image](/dropdown_historical_property.webp)
166168

167169
## Deleting Profiles

pages/docs/features/advanced.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -439,7 +439,7 @@ Comparisons are supported across all insights chart types. Depending on the exac
439439
| Insights Stacked Line | No | No | Yes |
440440
| Insights Bar | Yes | Yes | Yes |
441441
| Insights Stacked Bar | Yes | No | No |
442-
| Insights Pie | No | No | Yes |
442+
| Insights Pie | No | No | No |
443443
| Insights Metric | Yes | Yes | Yes |
444444
| Funnels Steps | Yes | Yes | No |
445445
| Funnel Trends | Yes | Yes | No |

pages/docs/features/attribution.mdx

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ import { Callout } from 'nextra/components'
1010

1111
Attribution helps teams attribute conversion credit to the touchpoints in a user journey, whether it's to the first or last touch (single-touch attribution models) or to multiple touchpoints using a multi-touch attribution model like U-shape or Linear.
1212

13-
Let’s consider an example user journey:
13+
Consider the following example user journey:
1414
1. A user sees an ad for a product on Facebook
1515
2. The user clicks on the ad and is taken to the product page on the company's website
1616
3. The user adds the product to their cart and begins the checkout process
@@ -64,7 +64,7 @@ If you use a Mixpanel js-sdk, we’ve updated our sdk to track utm parameters mo
6464
- **Attributed by property:** This is the property on a touchpoint event that we use for the attribution model. The canonical example is utm_source
6565
- **Lookback window:** The time window where a user's events with this attribution property are counted towards the calculation. The window ends when the conversion metric happens.
6666

67-
## Frequently Asked Questions
67+
## FAQ
6868

6969
### How does Mixpanel compute attribution under the hood?
7070

@@ -146,3 +146,7 @@ NOTE: You can apply a filter on an attribution property only after an attributio
146146
- Step 1: Turn on Attribution analysis by going to the breakdown section and choosing `Attributed by..` and property `XYZ`
147147
- Step 2 (a): Once attribution model has been applied, go to the filter section and choose the computed property `Attributed by XXX`. You can apply an attribution filter only on the property used in the attribution breakdown
148148
- Step 2 (b): Once attribution model has been applied, click on the chart bar and filter/exclude the segments as needed
149+
150+
### What does the "(not set)" attribution segment mean?
151+
152+
You may see a "(not set)" segment in your report when using the Attribution feature. This occurs when the attribution property is missing from all events being evaluated for the user.

pages/docs/session-replay/implement-session-replay/session-replay-web.mdx

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,6 @@ mixpanel.init(
100100
| --- | --- | --- |
101101
| `record_block_class` | CSS class name or regular expression for elements which will be replaced with an empty element of the same dimensions, blocking all contents. | `new RegExp('^(mp-block\|fs-exclude\|amp-block\|rr-block\|ph-no-capture)$')` <br/> (common industry block classes) |
102102
| `record_block_selector` | CSS selector for elements which will be replaced with an empty element of the same dimensions, blocking all contents. | `"img, video"` |
103-
| `record_collect_fonts` | When true, Mixpanel will collect and store the fonts on your site to use in playback. | `false` |
104103
| `record_idle_timeout_ms` | Duration of inactivity in milliseconds before ending a contiguous replay. A new replay collection will start when active again. | `1800000`<br/>(30 minutes) |
105104
| `record_mask_text_class` | CSS class name or regular expression for elements that will have their text contents masked. | `new RegExp('^(mp-mask\|fs-mask\|amp-mask\|rr-mask\|ph-mask)$')` <br/> (common industry mask classes) |
106105
| `record_mask_text_selector` | CSS selector for elements that will have their text contents masked. | `"*"` |

pages/docs/tracking-best-practices/traffic-attribution.mdx

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,12 @@ Mixpanel's Javascript library will also track initial_utm_parameters as a profil
2323

2424
UTM parameters are by default persisted across events as [Super Properties](/docs/tracking-methods/sdks/javascript#setting-super-properties). To opt in to the recommended modern behavior most compatible with our [attribution models](/docs/features/attribution), use the SDK initialization option `{stop_utm_persistence: true}` to disable UTM param persistence (refer to our [Release Notes](https://github.com/mixpanel/mixpanel-js/releases/tag/v2.52.0) in GitHub).
2525

26+
#### Organic Traffic
27+
28+
If a user arrives at your landing page organically, no UTM tags will be parsed because the URL does not contain them. As a result, the UTM property will be absent from the events and will appear as "(not set)" when used as a breakdown in a report. You can interpret a "(not set)" value for any UTM property as indicating organic or direct traffic.
29+
30+
Learn more about falsy values [here](/docs/data-structure/property-reference/data-type#undefined-and-null).
31+
2632
### Initial Referrer and Initial Referring Domain Properties
2733

2834
Mixpanel's Javascript library will track Initial Referrer and Initial Referring Domain and append them as a property to any event that a user completes. These properties are stored in the Mixpanel cookie the first time a user comes to your site and will not change on future site visits as long as the cookie is not cleared.
@@ -33,7 +39,7 @@ Having this information allows you to build reports to see how users from differ
3339

3440
#### $direct
3541

36-
An initial referrer is equal to $direct when a user first lands on a site without being referred by another website. The user may have typed the website address directly, clicked a bookmark, clicked a link from an email, or might have security settings in their browser that prevent referrer data from being passed.
42+
An initial referrer is equal to `$direct` when a user first lands on a site without being referred by another website. The user may have typed the website address directly, clicked a bookmark, clicked a link from an email, or might have security settings in their browser that prevent referrer data from being passed.
3743

3844
## Mobile Attribution
3945

pages/docs/tracking-methods/id-management/identifying-users-simplified.mdx

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -54,11 +54,12 @@ If an event contains a `$user_id`, the value of the `$user_id` will be set as th
5454

5555
## Client-side Identity Management
5656

57-
If using our Web/Mobile SDKs or a CDP like Segment or Rudderstack, there are only 2 steps:
58-
1. Call `.identify(<user_id>)` when a user signs up or logs in. Pass in the user's known identifier (eg: their ID from your database).
59-
2. Call `.reset()` when a user logs out.
57+
If using our Web/Mobile SDKs or a CDP like Segment or Rudderstack, there are only 2 steps to identity management:
58+
1. Call `.identify(<user_id>)` when a user signs up or logs in, passing in the user's known identifier (eg: their ID from your database).
59+
2. Send at least one event after the `.identify()` call. This is necessary to get the `$user_id` and `$device_id` to merge. Learn more about [the merge mechanism above](/docs/tracking-methods/id-management/identifying-users-simplified#mechanism).
60+
3. Call `.reset()` when a user logs out.
6061

61-
- Any events prior to calling `.identify` are considered anonymous events. Mixpanel's SDKs will generate a `$device_id` to associate these events to the same anonymous user. By calling `.identify(<user_id>)` when a user signs up or logs in, you're telling Mixpanel that `$device_id` belongs to a known user with ID `user_id`.
62+
- Any events prior to calling `.identify()` are considered anonymous events. Mixpanel's SDKs will generate a `$device_id` to associate these events to the same anonymous user. By calling `.identify(<user_id>)` when a user signs up or logs in, you're telling Mixpanel that `$device_id` belongs to a known user with ID `user_id`.
6263

6364
- Under the hood, Mixpanel will stitch the event streams of those users together. This works even if a user has multiple anonymous sessions (eg: on desktop and mobile). As long as you always call `.identify` when the user logs in, all of that activity will be stitched together.
6465

0 commit comments

Comments
 (0)