Skip to content

Commit 70cc67c

Browse files
authored
Refine and add training sample (#4)
* refine * train
1 parent 49ec838 commit 70cc67c

35 files changed

+5993
-87
lines changed

.devcontainer/devcontainer.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
2-
"name": "Azure Content Understanding Demo",
2+
"name": "Azure AI Content Understanding Demo",
33
"image": "mcr.microsoft.com/devcontainers/python:3.11",
44
"customizations": {
55
"vscode": {

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -160,3 +160,6 @@ cython_debug/
160160
# and can be added to the global gitignore or merged into this file. For a more nuclear
161161
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
162162
#.idea/
163+
164+
# VSCode
165+
.vscode

CONTRIBUTING.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Contributing to [project-title]
1+
# Contributing to Azure AI Content Understanding Samples (Python)
22

33
This project welcomes contributions and suggestions. Most contributions require you to agree to a
44
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us

README.md

+17-22
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,42 @@
11
# Azure AI Content Understanding Samples (Python)
22

3-
Welcome! Content Understanding is a solution that analyzes and comprehends various media content, such as **document, images, audio, and video**, transforming it into structured, organized, and searchable data.
4-
5-
- The contents of this repository default to the latest preview version: **(2024-12-01-preview)**.
3+
Welcome! Content Understanding is a solution that analyzes and comprehends various media content, such as **documents, images, audio, and video**, transforming it into structured, organized, and searchable data.
64

5+
- The samples in this repository default to the latest preview API version: **(2024-12-01-preview)**.
76

87
## Features
98

10-
Azure AI Content Understanding is a new Generative AI based [Azure AI service](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/overview), designed to process/ingest content of any types (document, image, audio, and video) into an user-defined output format. Content Understanding offers a streamlined process to reason over large amounts of unstructured data, accelerating time-to-value by generating an output that can be integrated into automation and analytical workflows.
9+
Azure AI Content Understanding is a new Generative AI-based [Azure AI service](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/overview), designed to process/ingest content of any type (documents, images, audio, and video) into a user-defined output format. Content Understanding offers a streamlined process to reason over large amounts of unstructured data, accelerating time-to-value by generating an output that can be integrated into automation and analytical workflows.
1110

11+
## Samples
1212

13-
## Sample List
1413
| File | Description |
1514
| --- | --- |
16-
| [field_extraction.ipynb](notebooks/field_extraction.ipynb) | Extract customized fields defined in analyzer templates |
17-
| [content_extraction.ipynb](notebooks/content_extraction.ipynb) | Extract structrued content understanding result from your input files |
18-
| [analyzer_training.ipynb](notebooks/analyzer_training.ipynb) | Provide training data to improve quality of your analyzer |
19-
20-
15+
| [field_extraction.ipynb](notebooks/field_extraction.ipynb) | Extract custom fields with sample analyzer templates |
16+
| [content_extraction.ipynb](notebooks/content_extraction.ipynb) | Extract structured content from your input files |
17+
| [analyzer_training.ipynb](notebooks/analyzer_training.ipynb) | Provide training data to improve the quality of your analyzer |
2118

2219
## Prerequisites
2320

24-
1. To get started, you need an active [Azure account](https://azure.microsoft.com/free/cognitive-services/). If you don't have one, you can [create a free subscription](https://azure.microsoft.com/free/).
25-
1. Once you have Azure subscription, create an [Content Understanding Service and Get endpoint and keys](Create_Content_Understanding_Service.md).
26-
27-
21+
To use Content Understanding, you need an [Azure AI Services resource](docs/create_azure_ai_service.md).
2822

29-
## Getting Started with GitHub Codespaces
23+
## Getting started with GitHub Codespaces
3024

31-
You can run this repo virtually by using GitHub Codespaces, which will open a web-based VS Code in your browser:
25+
You can run this repo virtually by using GitHub Codespaces, which will open a web-based VS Code in your browser.
3226

3327
[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://github.com/codespaces/new?skip_quickstart=true&machine=basicLinux32gb&repo=899687170&ref=main&geo=UsEast&devcontainer_path=.devcontainer%2Fdevcontainer.json)
3428

29+
### Configure Azure AI service resource
3530

36-
### Configure Setup Azure Content Understanding Resource
3731
1. Copy `notebooks/.env.sample` to `notebooks/.env`
38-
1. Fill **AZURE_CU_ENDPOINT** and ** AZURE_CU_API_KEY** with your resource
32+
2. Fill **AZURE_AI_ENDPOINT** and **AZURE_AI_API_KEY** with the endpoint and key values from your Azure portal Azure AI Services instance.
3933

34+
### Open a Jupyter notebook and follow the step-by-step guidance
4035

41-
### Open jupyter notebooks and follow the step-by-step guidance
42-
Go into `notebooks` and find the samples that interesting to you. Codespaces prepare essential environment fo you, simply execute steps.
36+
Navigate to the `notebooks` directory and select the sample notebook you are interested in. Since Codespaces is pre-configured with the necessary environment, you can directly execute each step in the notebook.
4337

38+
### Notes
4439

45-
### Note
40+
* **Trademarks** - This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos is subject to those third-party’s policies.
4641

47-
>Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
42+
* **Data Collection** - The software may collect information about you and your use of the software and send it to Microsoft. Microsoft may use this information to provide services and improve our products and services. You may turn off the telemetry as described in the repository. There are also some features in the software that may enable you and Microsoft to collect data from users of your applications. If you use these features, you must comply with applicable law, including providing appropriate notices to users of your applications together with a copy of Microsoft’s privacy statement. Our privacy statement is located at https://go.microsoft.com/fwlink/?LinkID=824704. You can learn more about data collection and use in the help documentation and our privacy statement. Your use of the software operates as your consent to these practices.

analyzer_templates/README.md

Whitespace-only changes.

analyzer_templates/sample_call_transcript_analyzer.json analyzer_templates/call_transcript.json

-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
{
2-
"analyzerId": "sample_chart_analyzer",
32
"description": "Sample call transcript analyzer",
43
"scenario": "callCenter",
54
"config": {

analyzer_templates/image_chart.json

+94
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
{
2+
"name": "Chart and diagram understanding",
3+
"description": "Extract detailed structured information from charts and diagrams.",
4+
"scenario": "image",
5+
"config": {
6+
"returnDetails": false
7+
},
8+
"fieldSchema": {
9+
"name": "ChartAndDiagram",
10+
"description": "Structured information from charts and diagrams.",
11+
"fields": {
12+
"Title": {
13+
"type": "string",
14+
"method": "generate",
15+
"description": "Verbatim title of the chart."
16+
},
17+
"ChartType": {
18+
"type": "string",
19+
"method": "classify",
20+
"description": "The type of chart.",
21+
"enum": [
22+
"area",
23+
"bar",
24+
"box",
25+
"bubble",
26+
"candlestick",
27+
"funnel",
28+
"heatmap",
29+
"histogram",
30+
"line",
31+
"pie",
32+
"radar",
33+
"rings",
34+
"rose",
35+
"treemap"
36+
],
37+
"enumDescriptions": {
38+
"histogram": "Continuous values on the x-axis, which distinguishes it from bar.",
39+
"rose": "In contrast to pie charts, the sectors are of equal angles and differ in how far each sector extends from the center of the circle."
40+
}
41+
},
42+
"TopicKeywords": {
43+
"type": "array",
44+
"method": "generate",
45+
"description": "Relevant topics associated with the chart, used for tagging.",
46+
"items": {
47+
"type": "string",
48+
"method": "generate",
49+
"examples": [
50+
"Business and finance",
51+
"Arts and culture",
52+
"Education and academics"
53+
]
54+
}
55+
},
56+
"DetailedDescription": {
57+
"type": "string",
58+
"method": "generate",
59+
"description": "Detailed description of the chart or diagram, not leaving out any key information. Include numbers, trends, and other details."
60+
},
61+
"Summary": {
62+
"type": "string",
63+
"method": "generate",
64+
"description": "Detailed summary of the chart, including highlights and takeaways."
65+
},
66+
"MarkdownDataTable": {
67+
"type": "string",
68+
"method": "generate",
69+
"description": "Underlying data of the chart in tabular markdown format. Give markdown output with valid syntax and accurate numbers, and fill any uncertain values with empty cells. If not applicable, output an empty string."
70+
},
71+
"AxisTitles": {
72+
"type": "object",
73+
"method": "generate",
74+
"description": "Titles of the x and y axes.",
75+
"properties": {
76+
"xAxisTitle": {
77+
"type": "string",
78+
"method": "generate"
79+
},
80+
"yAxisTitle": {
81+
"type": "string",
82+
"method": "generate"
83+
}
84+
}
85+
},
86+
"FootnotesAndAnnotations": {
87+
"type": "string",
88+
"method": "generate",
89+
"description": "All footnotes and textual annotations in the chart or diagram."
90+
}
91+
},
92+
"definitions": {}
93+
}
94+
}

analyzer_templates/sample_invoice_analyzer.json analyzer_templates/invoice.json

-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
{
2-
"analyzerId": "sample_invoice_analyzer",
32
"description": "Sample invoice analyzer",
43
"scenario": "document",
54
"fieldSchema": {

analyzer_templates/sample_marketing_video_analyzer.json analyzer_templates/marketing_video.json

-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
{
2-
"analyzerId": "sample_marketing_video_analyzer",
32
"description": "Sample marketing video analyzer",
43
"scenario": "videoShot",
54
"config": {
+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
{
2+
"name": "PurchaseOrder_Extraction_Sample",
3+
"description": "Extract useful information from purchase order",
4+
"scenario": "document",
5+
"fieldSchema": {
6+
"fields": {
7+
"PurchaseOrderNumber": {
8+
"type": "string",
9+
"method": "extract",
10+
"description": ""
11+
},
12+
"PurchaseDate": {
13+
"type": "date",
14+
"method": "extract",
15+
"description": ""
16+
},
17+
"TotalPayment": {
18+
"type": "number",
19+
"method": "extract",
20+
"description": ""
21+
},
22+
"ShippedToAddress": {
23+
"type": "string",
24+
"method": "extract",
25+
"description": ""
26+
}
27+
}
28+
}
29+
}

analyzer_templates/sample_chart_analyzer.json

-17
This file was deleted.
Binary file not shown.
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
{
2+
"$schema": "https://schema.ai.azure.com/mmi/2024-12-01-preview/labels.json",
3+
"fileId": "",
4+
"fieldLabels": {
5+
"PurchaseDate": {
6+
"type": "date",
7+
"valueDate": "2020-02-10",
8+
"spans": [
9+
{
10+
"offset": 149,
11+
"length": 10
12+
}
13+
],
14+
"confidence": 0.998,
15+
"source": "D(1,1167,418,1318,418,1318,450,1167,450)",
16+
"kind": "predicted"
17+
},
18+
"PurchaseOrderNumber": {
19+
"type": "string",
20+
"valueString": "9328424",
21+
"spans": [
22+
{
23+
"offset": 178,
24+
"length": 7
25+
}
26+
],
27+
"confidence": 0.998,
28+
"source": "D(1,1281,459,1391,459,1391,490,1281,490)",
29+
"kind": "predicted"
30+
},
31+
"ShippedToAddress": {
32+
"type": "string",
33+
"valueString": "932 N Cantaloupe Road Seattle, WA 38383",
34+
"spans": [
35+
{
36+
"offset": 265,
37+
"length": 3
38+
},
39+
{
40+
"offset": 269,
41+
"length": 1
42+
},
43+
{
44+
"offset": 271,
45+
"length": 10
46+
},
47+
{
48+
"offset": 282,
49+
"length": 4
50+
},
51+
{
52+
"offset": 287,
53+
"length": 8
54+
},
55+
{
56+
"offset": 296,
57+
"length": 2
58+
},
59+
{
60+
"offset": 299,
61+
"length": 5
62+
}
63+
],
64+
"confidence": 0.996,
65+
"source": "D(1,278,683,322,683,322,715,278,714);D(1,333,683,348,683,348,715,333,715);D(1,356,683,497,683,497,715,356,715);D(1,508,683,570,683,570,714,507,715);D(1,275,721,368,721,368,750,275,750);D(1,377,721,420,720,420,750,377,750);D(1,429,720,506,720,506,750,429,750)",
66+
"kind": "predicted"
67+
},
68+
"TotalPayment": {
69+
"type": "number",
70+
"valueNumber": 414,
71+
"spans": [
72+
{
73+
"offset": 998,
74+
"length": 7
75+
}
76+
],
77+
"confidence": 0.998,
78+
"source": "D(1,1428,1669,1531,1669,1531,1699,1428,1699)",
79+
"kind": "predicted"
80+
}
81+
},
82+
"metadata": {
83+
"displayName": "Form_3.jpg",
84+
"createdDateTime": "2024-11-19T00:05:24.577Z",
85+
"mimeType": "image/jpeg"
86+
}
87+
}

0 commit comments

Comments
 (0)