|
287 | 287 | "metadata": {},
|
288 | 288 | "source": [
|
289 | 289 | "## Caching\n",
|
290 |
| - "As illustrated above, reasoning models produce both reasoning tokens and completion tokens that are treated differently in the API today. This also has implications for cache utilization and latency. To illustrate the point, we include this helpful sketch.\n", |
| 290 | + "\n", |
| 291 | + "As shown above, reasoning models generate both reasoning tokens and completion tokens, which the API handles differently. This distinction affects how caching works and impacts both performance and latency. The following diagram illustrates these concepts:\n", |
291 | 292 | "\n",
|
292 | 293 | ""
|
293 | 294 | ]
|
|
296 | 297 | "cell_type": "markdown",
|
297 | 298 | "metadata": {},
|
298 | 299 | "source": [
|
299 |
| - "In turn 2, reasoning items from turn 1 are ignored and stripped, since the model doesn't reuse reasoning items from previous turns. This makes it impossible to get a full cache hit on the fourth API call in the diagram above, as the prompt now omits those reasoning items. However, including them does no harm—the API will automatically remove any reasoning items that aren't relevant for the current turn. Note that caching only matters for prompts longer than 1024 tokens. In our tests, switching from Completions to the Responses API increased cache utilization from 40% to 80%. Better cache utilization means better economics, since cached tokens are billed much less: for `o4-mini`, cached input tokens are 75% cheaper than uncached ones. Latency also improves." |
| 300 | + "In turn 2, any reasoning items from turn 1 are ignored and removed, since the model does not reuse reasoning items from previous turns. As a result, the fourth API call in the diagram cannot achieve a full cache hit, because those reasoning items are missing from the prompt. However, including them is harmless—the API will simply discard any reasoning items that aren’t relevant for the current turn. Keep in mind that caching only impacts prompts longer than 1024 tokens. In our tests, switching from the Completions API to the Responses API boosted cache utilization from 40% to 80%. Higher cache utilization leads to lower costs (for example, cached input tokens for `o4-mini` are 75% cheaper than uncached ones) and improved latency." |
300 | 301 | ]
|
301 | 302 | },
|
302 | 303 | {
|
|
305 | 306 | "source": [
|
306 | 307 | "## Encrypted Reasoning Items\n",
|
307 | 308 | "\n",
|
308 |
| - "For organizations that can't use the Responses API statefully due to compliance or data retention requirements (such as [Zero Data Retention](https://openai.com/enterprise-privacy/)), we've introduced [encrypted reasoning items](https://platform.openai.com/docs/guides/reasoning?api-mode=responses#encrypted-reasoning-items). This lets you get all the benefits of reasoning items while keeping your workflow stateless.\n", |
| 309 | + "Some organizations—such as those with [Zero Data Retention (ZDR)](https://openai.com/enterprise-privacy/) requirements—cannot use the Responses API in a stateful way due to compliance or data retention policies. To support these cases, OpenAI offers [encrypted reasoning items](https://platform.openai.com/docs/guides/reasoning?api-mode=responses#encrypted-reasoning-items), allowing you to keep your workflow stateless while still benefiting from reasoning items.\n", |
309 | 310 | "\n",
|
310 |
| - "To use this, simply add `[\"reasoning.encrypted_content\"]` to the `include` field. You'll receive an encrypted version of the reasoning tokens, which you can pass back to the API just as you would with regular reasoning items.\n", |
| 311 | + "To use encrypted reasoning items:\n", |
| 312 | + "- Add `[\"reasoning.encrypted_content\"]` to the `include` field in your API call.\n", |
| 313 | + "- The API will return an encrypted version of the reasoning tokens, which you can pass back in future requests just like regular reasoning items.\n", |
311 | 314 | "\n",
|
312 |
| - "For Zero Data Retention (ZDR) organizations, OpenAI enforces `store=false` at the API level. When a request arrives, the API checks for any `encrypted_content` in the payload. If present, it's decrypted in-memory using keys only OpenAI can access. This decrypted reasoning (chain-of-thought) is never written to disk and is used only for generating the next response. Any new reasoning tokens are immediately encrypted and returned to you. All transient data—including decrypted inputs and model outputs—is securely discarded after the response, with no intermediate state persisted, ensuring full ZDR compliance.\n", |
| 315 | + "For ZDR organizations, OpenAI enforces `store=false` automatically. When a request includes `encrypted_content`, it is decrypted in-memory (never written to disk), used for generating the next response, and then securely discarded. Any new reasoning tokens are immediately encrypted and returned to you, ensuring no intermediate state is ever persisted.\n", |
313 | 316 | "\n",
|
314 |
| - "Here’s a quick update to the earlier code snippet to show how this works:" |
| 317 | + "Here’s a quick code update to show how this works:" |
315 | 318 | ]
|
316 | 319 | },
|
317 | 320 | {
|
|
451 | 454 | "cell_type": "markdown",
|
452 | 455 | "metadata": {},
|
453 | 456 | "source": [
|
454 |
| - "Reasoning summary text enables you to design user experiences where users can peek into the model's thought process. For example, in conversations involving multiple function calls, users can see not only which function calls are made, but also the reasoning behind each tool call—without having to wait for the final assistant message. This provides greater transparency and interactivity in your application's UX." |
| 457 | + "Reasoning summary text lets you give users a window into the model’s thought process. For example, during conversations with multiple function calls, users can see both which functions were called and the reasoning behind each call—without waiting for the final assistant message. This adds transparency and interactivity to your application’s user experience." |
455 | 458 | ]
|
456 | 459 | },
|
457 | 460 | {
|
|
0 commit comments