|
11 | 11 | "cell_type": "markdown",
|
12 | 12 | "metadata": {},
|
13 | 13 | "source": [
|
14 |
| - "**The notebooks in this directory were developed to demonstrate the \"Ten Rules for Reproducible Research with Jupyter Notebooks\". Throughout the notebooks we mention the rules we applied.**\n", |
| 14 | + "**The notebooks in this directory were developed to demonstrate the \"Ten Rules for Reproducible Research with Jupyter Notebooks\". Throughout the notebooks we refer to some the rules we applied.**\n", |
15 | 15 | "\n",
|
16 | 16 | "**For example, this notebook demonstrates:**\n",
|
17 | 17 | "\n",
|
18 | 18 | "---\n",
|
19 | 19 | "\n",
|
20 | 20 | "**Rule 1: Tell a Story for a Specific Audience.** This notebook was developed for biologists to learn how to apply a simple machine learning model to protein sequences.\n",
|
21 | 21 | "\n",
|
22 |
| - "**Rule 3: Document the Entire Workflow.** This top-level notebook links to 3 notebooks that represent the steps of a workflow. This modularity makes it easy to replace one of the steps, for example, use a different method to calculate features or apply a different machine learning model.\n", |
| 22 | + "**Rule 3: Document the Entire Workflow.** This top-level notebook links to 4 notebooks that represent the steps of a workflow. This modularity makes it easy to replace one of the steps, for example, use a different method to calculate features or apply a different machine learning model.\n", |
23 | 23 | "\n",
|
24 | 24 | "---"
|
25 | 25 | ]
|
|
47 | 47 | "We can classify proteins into three major fold types based on their predominant secondary structure content\n",
|
48 | 48 | "* alpha: contains predominantly alpha helices\n",
|
49 | 49 | "* beta: contains predominantly beta sheets\n",
|
50 |
| - "* alpha+beta: contains both alpha helices and beta sheets" |
| 50 | + "* alpha+beta: contains alpha helices and beta sheets" |
51 | 51 | ]
|
52 | 52 | },
|
53 | 53 | {
|
54 | 54 | "cell_type": "markdown",
|
55 | 55 | "metadata": {},
|
56 | 56 | "source": [
|
57 | 57 | "## Goal\n",
|
58 |
| - "This notebook serves as an example of using machine learning techniques applied to protein sequences. The goal is to create a simple machine learning model to predict the fold type of a protein given its protein sequence. We train the model on a representative set of 3D structure from the Protein Data Bank.\n", |
| 58 | + "This notebook demostrates how to create a reproducible record to create a machine learning model. We train a simple model to predict the fold class of a protein given its protein sequence using a representative set of 3D structures from the Protein Data Bank.\n", |
59 | 59 | "\n",
|
60 | 60 | "Run the following notebooks to work through this example."
|
61 | 61 | ]
|
|
71 | 71 | "cell_type": "markdown",
|
72 | 72 | "metadata": {},
|
73 | 73 | "source": [
|
74 |
| - "First, we need to create a dataset with protein secondary structure information obtained from 3D protein chains.\n", |
| 74 | + "First, we create a dataset with protein secondary structure information obtained from 3D protein chains.\n", |
75 | 75 | "\n",
|
76 | 76 | "Run the following notebook to extract secondary structure information from a representative set of protein chains downloaded from the RCSB Protein Data Bank and assign a fold type to each protein chain."
|
77 | 77 | ]
|
|
87 | 87 | "cell_type": "markdown",
|
88 | 88 | "metadata": {},
|
89 | 89 | "source": [
|
90 |
| - "The notebook saves the dataset in the file `secondaryStructure.json`." |
| 90 | + "The notebook saves the dataset in the file `./intermediate_data/foldClassification.json`." |
91 | 91 | ]
|
92 | 92 | },
|
93 | 93 | {
|
|
117 | 117 | "cell_type": "markdown",
|
118 | 118 | "metadata": {},
|
119 | 119 | "source": [
|
120 |
| - "The notebook saves the dateset in the file `features.json`." |
| 120 | + "This notebook saves the dataset with feature vectors in the file `./intermediate_data/features.json`." |
121 | 121 | ]
|
122 | 122 | },
|
123 | 123 | {
|
|
131 | 131 | "cell_type": "markdown",
|
132 | 132 | "metadata": {},
|
133 | 133 | "source": [
|
134 |
| - "Next, we fit a 3-state classification model using the feature vectors as inputs and the known fold types from the Protein Data Bank dataset.\n", |
| 134 | + "Next, we fit a 3-state classification model using the feature vectors and the given fold classification from the Protein Data Bank dataset.\n", |
135 | 135 | "\n",
|
136 | 136 | "Run the following notebook to fit a machine learning model on a training set and evaluate its performance on a test set."
|
137 | 137 | ]
|
|
143 | 143 | "[3-FitModel.ipynb](./3-FitModel.ipynb)"
|
144 | 144 | ]
|
145 | 145 | },
|
| 146 | + { |
| 147 | + "cell_type": "markdown", |
| 148 | + "metadata": {}, |
| 149 | + "source": [ |
| 150 | + "This notebook saves the classification model in the file `./intermediate_data/classifier`." |
| 151 | + ] |
| 152 | + }, |
146 | 153 | {
|
147 | 154 | "cell_type": "markdown",
|
148 | 155 | "metadata": {},
|
|
154 | 161 | "cell_type": "markdown",
|
155 | 162 | "metadata": {},
|
156 | 163 | "source": [
|
157 |
| - "Finally, we use the Word2Vec model and the trained classifier to predict the fold class from a protein sequence." |
| 164 | + "Finally, we use the trained classifier to predict the fold class from a protein sequence." |
158 | 165 | ]
|
159 | 166 | },
|
160 | 167 | {
|
|
184 | 191 | },
|
185 | 192 | {
|
186 | 193 | "cell_type": "code",
|
187 |
| - "execution_count": 2, |
| 194 | + "execution_count": 1, |
188 | 195 | "metadata": {},
|
189 | 196 | "outputs": [
|
190 | 197 | {
|
191 | 198 | "name": "stdout",
|
192 | 199 | "output_type": "stream",
|
193 | 200 | "text": [
|
194 |
| - "The watermark extension is already loaded. To reload it, use:\n", |
195 |
| - " %reload_ext watermark\n", |
196 | 201 | "CPython 3.6.3\n",
|
197 | 202 | "IPython 6.3.1\n",
|
198 | 203 | "\n",
|
199 |
| - "gensim 3.6.0\n", |
| 204 | + "ipywidgets 7.4.0\n", |
200 | 205 | "matplotlib 2.2.2\n",
|
201 | 206 | "numpy 1.14.5\n",
|
202 | 207 | "pandas 0.22.0\n",
|
|
214 | 219 | ],
|
215 | 220 | "source": [
|
216 | 221 | "%load_ext watermark\n",
|
217 |
| - "%watermark -v -m -p gensim,matplotlib,numpy,pandas,sklearn" |
| 222 | + "%watermark -v -m -p ipywidgets,matplotlib,numpy,pandas,sklearn" |
| 223 | + ] |
| 224 | + }, |
| 225 | + { |
| 226 | + "cell_type": "markdown", |
| 227 | + "metadata": {}, |
| 228 | + "source": [ |
| 229 | + "---\n", |
| 230 | + "\n", |
| 231 | + "**Authors:** Peter W. Rose, Shih-Cheng Huang, UC San Diego, October 1, 2018\n", |
| 232 | + "\n", |
| 233 | + "---" |
218 | 234 | ]
|
219 | 235 | }
|
220 | 236 | ],
|
|
0 commit comments