Skip to content

Commit 8c9d4a1

Browse files
authored
Taxonomy Docs Update Part 2 (#40)
* docs(taxonomy): start reorganization of taxonomy section Signed-off-by: Laura Santamaria <[email protected]> * feat(trees): update the docs to explain taxonomy trees separately from the upstream structure Signed-off-by: Laura Santamaria <[email protected]> * feat(contributors): turn on contributor info and revision info, and add contributor wall Signed-off-by: Laura Santamaria <[email protected]> * fix(emails): hide email addresses of contributors Signed-off-by: Laura Santamaria <[email protected]> * feat(moving): move more things around and parse to split upstream from taxonomy Signed-off-by: Laura Santamaria <[email protected]> * feat(glossary): add tooltip starter for overall glossary Signed-off-by: Laura Santamaria <[email protected]> * feat(code-copy): add code copy button, code line select, and code annotations Signed-off-by: Laura Santamaria <[email protected]> * feat(tooltips): add the last of the abbreviations to the global tooltips Signed-off-by: Laura Santamaria <[email protected]> * docs(taxonomy): more updates Signed-off-by: Laura Santamaria <[email protected]> * docs(knowledge): more cleanup of the knowledge areas Signed-off-by: Laura Santamaria <[email protected]> * feat(more): more doc updates Signed-off-by: Laura Santamaria <[email protected]> * feat(examples): add final cleanup of examples Signed-off-by: Laura Santamaria <[email protected]> * fix(tables): re-add content from #43 Signed-off-by: Laura Santamaria <[email protected]> --------- Signed-off-by: Laura Santamaria <[email protected]>
1 parent 366828e commit 8c9d4a1

19 files changed

+1746
-850
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,6 @@ pyproject.toml
66
.python-version
77

88
# pycharm
9-
.idea
9+
.idea
10+
11+
.cache
364 KB
Loading

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Take a look at "lab-enhanced" models on the [InstructLab Hugging Face page](http
3838
- Approximately 60GB disk space (entire process)
3939

4040
!!! note
41-
Python 3.12 is currently not supported, because some dependencies don't work on Python 3.12, yet. As of now, we only support Python 3.10 or Python 3.11 (preferred). Any other Python version would install an older version of InstructLab (e.g. 0.17).
41+
Python 3.12 is currently not supported, because some dependencies don't work on Python 3.12, yet. As of now, we only support Python 3.10 or Python 3.11 (preferred). Any other Python version would install an older version of InstructLab (e.g., 0.17).
4242

4343
!!! tip
4444
When installing the `ilab` CLI on macOS, you may have to run the `xcode-select --install` command, installing the required packages previously listed.

docs/resources/CONTRIBUTORS.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
# Contributors
22

3-
Lets celebrate our friends who help out with this site here!
3+
Let's celebrate our friends who help out with this site here!
4+
5+
{{ git_site_authors }}

docs/taxonomy/index.md

Lines changed: 52 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -3,38 +3,71 @@ title: Welcome to InstructLab's Taxonomy
33
description: The overview of 🐶 InstructLab's Taxonomy.
44
logo: images/ilab_dog.png
55
---
6-
## Welcome to the InstructLab Taxonomy
6+
# About the InstructLab Taxonomy
77

88
InstructLab 🐶 uses a novel synthetic data-based alignment tuning method for Large Language Models (LLMs.) The "**lab**" in Instruct**Lab** 🐶 stands for [**L**arge-Scale **A**lignment for Chat**B**ots](https://arxiv.org/abs/2403.01081)[^1].
99

10-
The LAB method is driven by taxonomies, which are largely created manually and with care.
10+
The LAB method is driven by taxonomies, which are largely created manually and with care. Taxonomies allow you to create models tuned with your data (enhanced via synthetic data generation) using the LAB 🐶 method.
1111

12-
The [instructlab/taxonomy](https://github.com/instructlab/taxonomy) repository contains a taxonomy tree that allows you to create models tuned with your data (enhanced via synthetic data generation) using the LAB 🐶 method.
12+
The [instructlab/taxonomy](https://github.com/instructlab/taxonomy) repository contains a comprehensive taxonomy tree that we use to build the overall model. You are welcome to contribute to it!
1313

1414
[^1]: Shivchander Sudalairaj*, Abhishek Bhandwaldar*, Aldo Pareja*, Kai Xu, David D. Cox, Akash Srivastava*. "LAB: Large-Scale Alignment for ChatBots", [arXiv preprint arXiv: 2403.01081, 2024](https://arxiv.org/abs/2403.01081). (* denotes equal contributions)
1515

16-
## Choosing domains for the taxonomy
16+
### Skills and Knowledge
1717

18-
In general, we use the Dewey Decimal Classification (DDC) System to determine our domains (and subdomains) in the taxonomy. This [DDC SUMMARIES document](https://www.oclc.org/content/dam/oclc/dewey/resources/summaries/deweysummaries.pdf) is a great resource for determining where a topic might be classified.
18+
In a taxonomy, we have "skills," or performative actions, and "knowledge," or based more on answering questions that involve facts, data, or references. [Learn more about knowledge in the knowledge guide](knowledge/index.md), and [learn more about skills in the skills guide](skills/index.md).
1919

20-
If you are unsure where to put your knowledge or compositional skill, create a folder in the `miscellaneous_unknown` folder under the `knowledge` or `compositional_skills` folders.
20+
![An overview of the LAB alignment method. Starting from the taxonomy root, data are curated in each top-level groups and examples in the leaf nodes are used by the synthetic data generators to generate orders of magnitude data for the phased-training step for fine-tuning.](../images/taxonomy_paper_diagram.png)
21+
22+
## Taxonomy trees
23+
24+
The overall structure of a taxonomy tree for InstructLab is a cascading file structure. The top-level directory is called the "root" of the tree. The resulting subdirectories are "branches" of the tree. Along a branch are more nested directories until we reach the final directory. That final directory along a branch is called the "leaf node," and it contains a YAML file named `qna.yaml` and, in the case of the [upstream taxonomy](#the-upstream-taxonomy), an `attribution.txt` file that holds citation information. The only required file in a leaf node directory for the InstructLab process is the `qna.yaml` file. You can learn more about the structure of the `qna.yaml` file in the [knowledge](knowledge/index.md) and [skill](skills/index.md) guides.
25+
26+
### Tree structure
27+
28+
The tree structure is important because it is used by the synthetic data generation process to relate chunks of knowledge together. Without it, training would not be as accurate.
29+
30+
The root of a taxonomy tree does not need to be the root of all knowledge. The only requirements for a taxonomy tree are that
31+
32+
* knowledge is within a `knowledge` directory
33+
* ungrounded compositional skills are within a `compositional_skills` directory
34+
* grounded compositional skills are within a `compositional_skills/grounded/` directory structure
35+
* the `knowledge` and `compositional_skills` directories are within a single root directory
36+
37+
This helps the synthetic data generation process and training process parse the files in the right order.
38+
39+
### Sorting knowledge and skills
2140

22-
## Learning
41+
For each piece of knowledge, you should have a single `qna.yaml` file. For example, if you are fine-tuning a model to talk about cloud formations, you would make a leaf node directory for each type of cloud formation (e.g., `cumulonimbus`, `cirrus`, `cumulus`, `incus`, `lenticular`) and then have a `qna.yaml` file dedicated to each formation with a document for each one. You would not lump all the cloud formations together into one YAML file with five or six documents as sources as the synthetic data generation process would not group the resulting data based on cloud formation, thereby making the resulting model possibly provide information about one cloud formation when asked about another.
2342

24-
Learn about the concepts of "skills" and "knowledge" in our [InstructLab Community Learning Guide](https://github.com/instructlab/community/blob/main/docs/README.md).
43+
The same thought applies to skills. A single skill should be in one leaf node directory, even if it is related to another skill. Do not create a `qna.yaml` file that has multiple skills in it.
2544

26-
## Taxonomy tree layout
45+
### The upstream taxonomy
46+
47+
We have a [comprehensive taxonomy tree in the InstructLab project](https://github.com/instructlab/taxonomy) used to build the community model. We welcome contributions to that taxonomy. Here's more information on how that tree is structured.
48+
49+
#### Upstream tree domain classification
50+
51+
In general, we use the Dewey Decimal Classification (DDC) System to determine our domains (and subdomains) in the overall taxonomy. This [DDC SUMMARIES document](https://www.oclc.org/content/dam/oclc/dewey/resources/summaries/deweysummaries.pdf) is a great resource for determining where a topic might be classified.
52+
53+
Generally, expect that there may be several layers you need to add to the taxonomy tree when adding knowledge or skills. For example, if you were to write a knowledge submission about a constellation, you would need to add directories to the tree, primarily `astronomy/constellations/` inside the `knowledge/science/` branch, before you would add your submission. These are "branches", with your submission as a "leaf node." The taxonomy is very much like a tree.
54+
55+
If you are unsure where to put your knowledge or compositional skill, create a folder in the `miscellaneous_unknown` folder under the `knowledge` or `compositional_skills` folders.
56+
57+
#### Upstream taxonomy tree layout
2758

2859
The taxonomy tree is organized in a cascading directory structure. At the end of each branch, there is a YAML file (`qna.yaml`) that contains the examples for that domain along with any attribution files (`attribution.txt`). Maintainers can decide to change the names of the existing branches or to add new branches.
2960

3061
!!! important
3162
Folder names do not have spaces. Use underscores between words.
3263

33-
### Taxonomy diagram
64+
##### Taxonomy diagram
3465

3566
!!! note
3667
These diagrams show subsets of the taxonomy. They are not a complete representation.
3768

69+
In this diagram, a subset of a taxonomy for InstructLab demonstrates the branch-and-leaf-node structure.
70+
3871
```mermaid
3972
flowchart TD;
4073
na[not accepting contributions\n at this time]:::na
@@ -69,7 +102,7 @@ The taxonomy tree is organized in a cascading directory structure. At the end of
69102
classDef na fill:#EEE
70103
```
71104

72-
Below is an illustrative directory structure to show this layout:
105+
Here is an illustrative directory structure to show how the taxonomy is laid out in the practical sense:
73106

74107
```ascii
75108
.
@@ -93,17 +126,17 @@ Below is an illustrative directory structure to show this layout:
93126
└── qna.yaml
94127
attribution.txt
95128
```
96-
## Contribute knowledge and skills to the taxonomy
129+
#### Contribute knowledge and skills to the taxonomy
97130

98131
The ability to contribute to a Large Language Model (LLM) has been difficult in no small part because it is difficult to get access to the necessary compute infrastructure.
99132

100-
This taxonomy repository will be used as the seed to synthesize the training data for InstructLab-trained models. We intend to retrain the model(s) using the main branch following InstructLab's progressive training on a regular basis. This enables fast iteration of the model(s), for the benefit of the open source community.
133+
The [taxonomy repository](https://github.com/instructlab/taxonomy) will be used as the seed to synthesize the training data for InstructLab-trained models. We intend to retrain the model(s) using the main branch following InstructLab's progressive training on a regular basis. This enables fast iteration of the model(s), for the benefit of the open source community.
101134

102135
By contributing your skills and knowledge to this repository, you will see your changes built into an LLM within days of your contribution rather than months or years! If you are working with a model and notice its knowledge or ability lacking, you can correct it by contributing knowledge or skills and check if it's improved after your changes are built.
103136

104137
While public contributions are welcome to help drive community progress, you can also fork this repository under [the Apache License, Version 2.0](../LICENSE), add your own internal skills, and train your own models internally. However, you might need your own access to significant compute infrastructure to perform sufficient retraining.
105138

106-
### Ways to contribute
139+
##### Ways to contribute
107140

108141
You can contribute to the taxonomy in the following two ways:
109142

@@ -112,16 +145,13 @@ You can contribute to the taxonomy in the following two ways:
112145

113146
For more information, see the [Ways of contributing to the taxonomy repository](https://github.com/instructlab/taxonomy/blob/main/CONTRIBUTING.md#ways-of-contributing-to-the-taxonomy-repository) documentation.
114147

115-
### How to contribute skills and knowledge
148+
##### How to contribute skills and knowledge
116149

117-
To contribute to this repo, you'll use the *Fork and Pull* model common in many open source repositories. You can add your skills and knowledge to the taxonomy in multiple ways; for additional information on how to make a contribution, see the [Documentation on contributing](../community/CONTRIBUTING.md). You can also use the following guides to help with contributing:
150+
To contribute to the repo, you'll use the *Fork and Pull* model common in many open source repositories. You can add your skills and knowledge to the taxonomy in multiple ways; for additional information on how to make a contribution, see the [Documentation on contributing](../community/CONTRIBUTING.md). You can also use the following guides to help with contributing:
118151

119152
- Contributing using the [GitHub webpage UI](https://github.com/instructlab/taxonomy/blob/main/docs/contributing_via_GH_UI.md).
120-
- Contributing knowledge to the taxonomy in the [Knowledge contribution guidelines](../taxonomy/knowledge/guide.md).
153+
- Contributing knowledge to the taxonomy in the [Knowledge contribution guidelines](../taxonomy/upstream/knowledge_contribution_details.md).
121154

122-
#### Why should I contribute?
155+
###### Why should I contribute?
123156

124-
This taxonomy repository will be used as the seed to synthesize the training
125-
data for InstructLab-trained models. We intend to retrain the model(s) using the main
126-
branch as often as possible (at least weekly).
127-
Fast iteration of the model(s) benefits the open source community and enables model developers who do not have access to the necessary compute infrastructure.
157+
This taxonomy repository will be used as the seed to synthesize the training data for InstructLab-trained models. We intend to retrain the model(s) using the main branch as often as possible (at least weekly). Fast iteration of the model(s) benefits the open source community and enables model developers who do not have access to the necessary compute infrastructure.

0 commit comments

Comments
 (0)