You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/taxonomy/knowledge/contribution_details.md
+10-6Lines changed: 10 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -67,28 +67,32 @@ For knowledge submissions, we need a `qna.yaml` file and an `attribution.txt` fi
67
67
68
68
For the current version of the taxonomy, version 3, here are the available fields:
69
69
70
+
!!! note
71
+
Tokens in the case of context, questions, and answers can fit to "words," but it's specifically tokens, and not words, that are the limitations.
72
+
70
73
Key | Type | Required | Constraints | Value | Notes
71
74
--|--|--|--|--|--
72
75
`version` | Y | integer | - | `3` | The taxonomy schema version used in the `qna.yaml` file. Defined in [instructlab/schema](https://github.com/instructlab/schema)
73
76
`created_by` | Y | string | - | Your GitHub username | -
74
77
`domain` | Y | string | - | Knowledge sub-category | The knowledge domain which is used in prompts to the teacher model during synthetic data generation. The domain should be brief such as the title to a textbook chapter or section.
75
78
`seed_examples` | Y | array | at least 5 sets | null | This is a collection of questions and answers with context from the knowledge document that InstructLab uses to generate data synthetically.
76
-
`context` | Y | string | < 500 words | A chunk of information from the original knowledge document | This should be a copy-paste from the Markdown version of your document
79
+
`context` | Y | string | < 500 tokens | A chunk of the document showing off the different **unique** content to help guide the teacher model. If you have only text, that's one thing, but if you have tables or other content, be sure to add that, too. | This should be a copy-paste from the Markdown version of your document
77
80
`questions_and_answers` | Y | array | at least 3 pairs per context | null | This is a collection of questions and answers.
78
-
`question` | Y | string | > 250 words | A question related to the context | Questions are things you'd expect someone to ask the model based on the context given. This will be used for synthetic data generation.
79
-
`answer` | Y | string | > 250 words | An answer for the question | Answers are what you'd like the model to give as an answer. It will not be an exact answer the model always gives.
80
-
`document_outline` | Y | string | - | A brief summary of the document | -
81
+
`question` | Y | string | > 250 tokens | A question related to the grounded in the relevant context | Questions are things you'd expect someone to ask the model based on the context given. This will be used for synthetic data generation.
82
+
`answer` | Y | string | > 250 tokens | An answer for the question, longer then a one-word answer. | Answers are what you'd like the model to give as an answer. It will not be an exact answer the model always gives.
83
+
`document_outline` | Y | string | - | This provides the context specific for each document chunk; this should be as **specific** as you possibly can get.
81
84
`document` | Y | object | - | null | The collection of data for the knowledge document.
82
85
`repo` | Y | string | a git URL | The URL (with a `.git` suffix) that identifies your git repo where you've stored your knowledge documents | -
83
86
`commit` | Y | string | full commit hash | A SHA1 full commit hash that corresponds to the document in the repo | This hash must be exactly where the system can find the document.
84
87
`patterns` | Y | array | `*.md`, `*.pdf` | A list of glob patterns specifying the files in the repo. | Any glob pattern that starts with `*` must be quoted due to YAML rules. Currently, the system accepts `.md` and `.pdf` files.
85
88
86
89
!!! important
87
-
There must be at least 5 sets of questions and answers with context in every `qna.yaml` file.
90
+
There must be at least 5 sets of 3 questions and 3 answers with context in every `qna.yaml` file. Also the "context blocks" should be as diverse and unique as possible. The goal is to get as much different
91
+
information in so as the teacher LLM reads through the document it gets "inspired" by the different content.
88
92
89
93
#### An example file
90
94
91
-
To build a strong taxonomy,
95
+
To build a strong taxonomy,
92
96
93
97
## Create a pull request in the taxonomy repository
0 commit comments