diff --git a/Examples.ipynb b/Examples.ipynb index 273a252..2eb5866 100644 --- a/Examples.ipynb +++ b/Examples.ipynb @@ -17,22 +17,27 @@ "View instructions provided in the main README.md available at https://github.com/bartongroup/ProteoFAV" ] }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import proteofav" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Example Usage" + "## Configuration" ] }, { - "cell_type": "code", - "execution_count": 1, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "# ipython magics to keep reloading the project (during testing)\n", - "%load_ext autoreload\n", - "%autoreload 2" + "ProteoFAV implements two approaches to handle datasets. One can fetch a few files on the fly using functions conveniently provided. For large scale studies, however, is preferable to use a local source for the multiple data used, such as the mmCIF files for three-dimensional protein structures." ] }, { @@ -59,9 +64,7 @@ { "cell_type": "code", "execution_count": 3, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "import os\n", @@ -218,6 +221,13 @@ "print(mmcif_bio.columns)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For a forma description of each colum please see http://mmcif.wwpdb.org/" + ] + }, { "cell_type": "code", "execution_count": 6, @@ -259,7 +269,8 @@ } ], "source": [ - "# PDB Lines are parsed so that column names mimic those of the mmCIF format\n", + "# Column names mimic of a PDB file mimics those of the mmCIF format\n", + "# Please prefer processing mmCIF instead PDB, which were deprecated\n", "pdb = PDB.read(filename=out_pdb)\n", "print(pdb.head())\n", "print(pdb.columns)" @@ -269,15 +280,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Dowloading a SIFTS xml record for obtaining PDB-UniProt mapping" + "### Dowloading a SIFTS xml record for obtaining PDB-UniProt mapping" ] }, { "cell_type": "code", "execution_count": 7, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "from proteofav.sifts import SIFTS\n", @@ -357,6 +366,21 @@ "print(sifts.head())" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The SIFT record also contains mappings to many other databases, such as:\n", + "- CATH\n", + "- SCOP\n", + "- PFAM\n", + "\n", + "Bear in mind that SIFT mapping occurs at residue, but also at the domain level. \n", + "The default action is to load the residue mapping.\n", + "\n", + "Also see the *PDB_Annotation* which flags several types of annotation at residue level, for example whether a given UniProt residues was observed." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -457,54 +481,47 @@ "name": "stdout", "output_type": "stream", "text": [ - " validation_rama validation_ligRSRnbrMean validation_chain validation_ent \\\n", - "0 NaN NaN A 1 \n", - "1 Favored NaN A 1 \n", - "2 Favored NaN A 1 \n", - "3 Favored NaN A 1 \n", - "4 Favored NaN A 1 \n", - "\n", - " validation_altcode validation_rota validation_icode \\\n", - "0 . t ? \n", - "1 . Cg_exo ? \n", - "2 . t90 ? \n", - "3 . p90 ? \n", - "4 . Cg_endo ? \n", - "\n", - " validation_ligRSRnbrStdev validation_ligRSRnumnbrs validation_resname \\\n", - "0 NaN NaN VAL \n", - "1 NaN NaN PRO \n", - "2 NaN NaN TRP \n", - "3 NaN NaN PHE \n", - "4 NaN NaN PRO \n", - "\n", - " ... validation_lig_rsrz_nbr_id validation_ligRSRZ \\\n", - "0 ... NaN NaN \n", - "1 ... NaN NaN \n", - "2 ... NaN NaN \n", - "3 ... NaN NaN \n", - "4 ... NaN NaN \n", - "\n", - " validation_flippable-sidechain validation_model validation_said \\\n", - "0 NaN 1 A \n", - "1 NaN 1 A \n", - "2 NaN 1 A \n", - "3 NaN 1 A \n", - "4 NaN 1 A \n", - "\n", - " validation_rsrz validation_psi validation_mogul-ignore validation_owab \\\n", - "0 -0.160 NaN NaN 52.97 \n", - "1 -0.274 149.9 NaN 28.84 \n", - "2 -0.874 139.2 NaN 33.47 \n", - "3 -0.308 150.9 NaN 39.98 \n", - "4 -0.204 130.9 NaN 26.14 \n", - "\n", - " validation_rscc \n", - "0 0.896 \n", - "1 0.960 \n", - "2 0.961 \n", - "3 0.920 \n", - "4 0.973 \n", + " validation_rscc validation_rama validation_icode validation_ligRSRZ \\\n", + "0 0.896 NaN ? NaN \n", + "1 0.960 Favored ? NaN \n", + "2 0.961 Favored ? NaN \n", + "3 0.920 Favored ? NaN \n", + "4 0.973 Favored ? NaN \n", + "\n", + " validation_ligRSRnbrMean validation_flippable-sidechain validation_psi \\\n", + "0 NaN NaN NaN \n", + "1 NaN NaN 149.9 \n", + "2 NaN NaN 139.2 \n", + "3 NaN NaN 150.9 \n", + "4 NaN NaN 130.9 \n", + "\n", + " validation_rsr validation_owab validation_ligRSRnumnbrs ... \\\n", + "0 0.233 52.97 NaN ... \n", + "1 0.190 28.84 NaN ... \n", + "2 0.154 33.47 NaN ... \n", + "3 0.229 39.98 NaN ... \n", + "4 0.197 26.14 NaN ... \n", + "\n", + " validation_chain validation_phi validation_said validation_rsrz \\\n", + "0 A NaN A -0.160 \n", + "1 A -51.6 A -0.274 \n", + "2 A -81.0 A -0.874 \n", + "3 A -138.9 A -0.308 \n", + "4 A -65.3 A -0.204 \n", + "\n", + " validation_seq validation_ligRSRnbrStdev validation_altcode \\\n", + "0 1 NaN . \n", + "1 2 NaN . \n", + "2 3 NaN . \n", + "3 4 NaN . \n", + "4 5 NaN . \n", + "\n", + " validation_lig_rsrz_nbr_id validation_NatomsEDS validation_resnum \n", + "0 NaN 7 118 \n", + "1 NaN 7 119 \n", + "2 NaN 14 120 \n", + "3 NaN 11 121 \n", + "4 NaN 7 122 \n", "\n", "[5 rows x 27 columns]\n" ] @@ -515,6 +532,13 @@ "print(validation.head())" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "PDB validation record is convenient when filtering a protein structure for analysis." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -522,6 +546,13 @@ "### Select only CA residues in for a single chain" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Protein structure representation is a hierarchical data structure (See http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ). So to obtain the data in tabular format, ProteoFAV transforms the data. For example, for use cases that require one residue per row, the residue three-dimensional coordinates can be represented by the residue's Cα. Other filtering parameters are obtained with *filter_structures*" + ] + }, { "cell_type": "code", "execution_count": 13, @@ -576,7 +607,7 @@ "mmcif_sel = filter_structures(mmcif, excluded_cols=None,\n", " models='first', chains='A', res=None, res_full=None,\n", " comps=None, atoms='CA', lines=None, category='auth',\n", - " residue_agg=False, agg_method='centroid',\n", + " residue_agg=False, \n", " add_res_full=True, add_atom_altloc=False, reset_atom_id=True,\n", " remove_altloc=False, remove_hydrogens=True, remove_partial_res=False)\n", "print(mmcif_sel.head())" @@ -586,7 +617,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Aggregating atoms residue-by-residue" + "### Aggregating atoms residue-by-residue\n", + "Three dimensional coordinates of all atoms can be represented by the residues centroid" ] }, { @@ -646,9 +678,7 @@ }, { "cell_type": "markdown", - "metadata": { - "collapsed": true - }, + "metadata": {}, "source": [ "### Write a PDB-formatted file from a mmCIF structure" ] @@ -740,7 +770,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Downloading a sequence Annotation (GFF) from UniProt" + "### Downloading a sequence Annotation (GFF) from UniProt\n", + "UniProt provides extensive, high-quality annotation for residues in proteins" ] }, { @@ -762,7 +793,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Loading the sequence Annotation" + "### Loading the sequence Annotation\n", + "Note also that GFF files althoug tabular, contains some extra level nesting in the `GROUP` column. ProteoFAV tries to deconvolute this information" ] }, { @@ -813,25 +845,26 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Downloading variants based on the UniProt ID" + "### Downloading variants based on the UniProt ID\n", + "We could fetch genetic variants from UniProt and Ensembl with:\n", + "\n", + "```python\n", + "Variants.fetch(identifier=uniprot_ids[0], id_source='uniprot', \n", + " synonymous=False, uniprot_vars=True,\n", + " ensembl_germline_vars=True, ensembl_somatic_vars=True)\n", + "```\n", + "\n", + "but `select_variants` handles merging of Ensembl vars for us" ] }, { "cell_type": "code", "execution_count": 19, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "from proteofav.variants import Variants\n", "\n", - "# we could simply fetch the variants from UniProt and Ensembl\n", - "# Variants.fetch(identifier=uniprot_ids[0], id_source='uniprot', \n", - "# synonymous=False, uniprot_vars=True,\n", - "# ensembl_germline_vars=True, ensembl_somatic_vars=True)\n", - "\n", - "# but `select_variants` handles merging of Ensembl vars for us\n", "uniprot, ensembl = Variants.select(identifier=uniprot_ids[0], id_source='uniprot', \n", " synonymous=False, uniprot_vars=True,\n", " ensembl_germline_vars=True, ensembl_somatic_vars=True)\n" @@ -854,79 +887,79 @@ "output_type": "stream", "text": [ " accession alternativeSequence association_description association_disease \\\n", - "0 P00439 C mild True \n", - "1 P00439 N NaN True \n", - "2 P00439 del NaN NaN \n", - "3 P00439 F NaN True \n", - "4 P00439 L haplotypes 1,4 True \n", + "0 P00439 A NaN True \n", + "1 P00439 L haplotypes 1,4 True \n", + "2 P00439 L NaN True \n", + "3 P00439 S haplotype 36 True \n", + "4 P00439 V NaN True \n", "\n", " association_evidences_code \\\n", "0 ECO:0000269 \n", "1 ECO:0000269 \n", "2 NaN \n", - "3 NaN \n", + "3 ECO:0000269 \n", "4 ECO:0000269 \n", "\n", " association_evidences_source_alternativeUrl \\\n", - "0 http://europepmc.org/abstract/MED/9048935 \n", - "1 [http://europepmc.org/abstract/MED/12501224, h... \n", + "0 [http://europepmc.org/abstract/MED/22513348, h... \n", + "1 [http://europepmc.org/abstract/MED/22513348, h... \n", "2 NaN \n", - "3 NaN \n", - "4 [http://europepmc.org/abstract/MED/12501224, h... \n", + "3 http://europepmc.org/abstract/MED/2014802 \n", + "4 [http://europepmc.org/abstract/MED/8889590, ht... \n", "\n", " association_evidences_source_id \\\n", - "0 9048935 \n", - "1 [12501224, 22513348, 1358789] \n", + "0 [22513348, 8889590, 8088845, 12501224] \n", + "1 [22513348, 1672290, 8889590, 12501224, 1672294] \n", "2 NaN \n", - "3 NaN \n", - "4 [12501224, 1672290, 8889590, 22513348, 1672294] \n", + "3 2014802 \n", + "4 [8889590, 12501224, 22513348, 23792259] \n", "\n", " association_evidences_source_name \\\n", "0 PubMed \n", "1 PubMed \n", "2 NaN \n", - "3 NaN \n", + "3 PubMed \n", "4 PubMed \n", "\n", " association_evidences_source_url \\\n", - "0 http://www.ncbi.nlm.nih.gov/pubmed/9048935 \n", - "1 [http://www.ncbi.nlm.nih.gov/pubmed/12501224, ... \n", + "0 [http://www.ncbi.nlm.nih.gov/pubmed/22513348, ... \n", + "1 [http://www.ncbi.nlm.nih.gov/pubmed/22513348, ... \n", "2 NaN \n", - "3 NaN \n", - "4 [http://www.ncbi.nlm.nih.gov/pubmed/12501224, ... \n", + "3 http://www.ncbi.nlm.nih.gov/pubmed/2014802 \n", + "4 [http://www.ncbi.nlm.nih.gov/pubmed/8889590, h... \n", "\n", " association_name \\\n", - "0 Phenylketonuria (PKU) \n", - "1 [Phenylketonuria (PKU), Hyperphenylalaninemia ... \n", - "2 NaN \n", + "0 [Phenylketonuria (PKU), Hyperphenylalaninemia ... \n", + "1 Phenylketonuria (PKU) \n", + "2 Hyperphenylalaninemia (HPA) \n", "3 Phenylketonuria (PKU) \n", "4 Phenylketonuria (PKU) \n", "\n", " ... siftPrediction siftScore \\\n", - "0 ... deleterious 0 \n", - "1 ... tolerated 1 \n", + "0 ... tolerated 0.11 \n", + "1 ... deleterious 0 \n", "2 ... NaN NaN \n", "3 ... NaN NaN \n", - "4 ... deleterious 0 \n", - "\n", - " somaticStatus sourceType taxid type wildType xrefs_id \\\n", - "0 0 uniprot 9606 VARIANT Y rs62514927 \n", - "1 0 uniprot 9606 VARIANT D rs62644499 \n", - "2 0 uniprot 9606 VARIANT d NaN \n", - "3 0 uniprot 9606 VARIANT S rs62508577 \n", - "4 0 uniprot 9606 VARIANT P rs5030851 \n", - "\n", - " xrefs_name \\\n", - "0 [dbSNP, Ensembl, ExAC] \n", - "1 [dbSNP, Ensembl, ExAC] \n", - "2 NaN \n", - "3 [dbSNP, Ensembl] \n", - "4 [dbSNP, Ensembl, ESP, ExAC] \n", + "4 ... tolerated 0.06 \n", + "\n", + " somaticStatus sourceType taxid type wildType xrefs_id \\\n", + "0 0 mixed 9606 VARIANT V rs796052017 \n", + "1 0 mixed 9606 VARIANT P rs5030851 \n", + "2 0 uniprot 9606 VARIANT Q rs199475662 \n", + "3 0 uniprot 9606 VARIANT L rs62642930 \n", + "4 0 mixed 9606 VARIANT A rs5030857 \n", + "\n", + " xrefs_name \\\n", + "0 [dbSNP, Ensembl, 1000Genomes, ESP, ExAC] \n", + "1 [dbSNP, Ensembl, ESP, ExAC] \n", + "2 [dbSNP, Ensembl] \n", + "3 [dbSNP, Ensembl] \n", + "4 [dbSNP, Ensembl, 1000Genomes, ESP, ExAC] \n", "\n", " xrefs_url \n", "0 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", "1 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", - "2 NaN \n", + "2 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", "3 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", "4 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", "\n", @@ -948,18 +981,18 @@ "output_type": "stream", "text": [ " Parent allele begin clinical_significance codons \\\n", - "0 ENST00000553106 HGMD_MUTATION 257 [] \n", + "0 ENST00000553106 HGMD_MUTATION 377 [] \n", "1 ENST00000553106 C/T 75 [] Gat/Aat \n", - "2 ENST00000553106 HGMD_MUTATION 47 [] \n", - "3 ENST00000553106 HGMD_MUTATION 100 [] \n", - "4 ENST00000553106 HGMD_MUTATION 261 [] \n", + "2 ENST00000553106 HGMD_MUTATION 300 [] \n", + "3 ENST00000553106 HGMD_MUTATION 245 [] \n", + "4 ENST00000553106 HGMD_MUTATION 415 [] \n", "\n", " consequenceType end feature_type frequency \\\n", - "0 coding_sequence_variant 257 transcript_variation NaN \n", + "0 coding_sequence_variant 377 transcript_variation NaN \n", "1 missense_variant 75 transcript_variation NaN \n", - "2 coding_sequence_variant 47 transcript_variation NaN \n", - "3 coding_sequence_variant 100 transcript_variation NaN \n", - "4 coding_sequence_variant 261 transcript_variation NaN \n", + "2 coding_sequence_variant 300 transcript_variation NaN \n", + "3 coding_sequence_variant 245 transcript_variation NaN \n", + "4 coding_sequence_variant 415 transcript_variation NaN \n", "\n", " polyphenScore residues seq_region_name siftScore translation \\\n", "0 NaN ENSP00000448059 NaN ENSP00000448059 \n", @@ -969,11 +1002,11 @@ "4 NaN ENSP00000448059 NaN ENSP00000448059 \n", "\n", " xrefs_id \n", - "0 CM010966 \n", + "0 CD011183 \n", "1 rs767453024 \n", - "2 CM941126 \n", - "3 CM992944 \n", - "4 CM910287 \n" + "2 CM950893 \n", + "3 CM941133 \n", + "4 CM920564 \n" ] } ], @@ -985,7 +1018,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Merging down the two Variants tables" + "### Merging down the two Variants tables\n", + "For merging variants from the UniProt and Ensembl" ] }, { @@ -997,30 +1031,30 @@ "name": "stdout", "output_type": "stream", "text": [ - " Parent accession allele alternativeSequence \\\n", - "0 NaN P00439 NaN del \n", - "1 NaN P00439 NaN P \n", - "2 NaN P00439 NaN Y \n", - "3 ENST00000553106 P00439 A/C/G * \n", - "4 NaN P00439 NaN L \n", - "\n", - " association_description association_disease association_evidences_code \\\n", - "0 NaN NaN NaN \n", - "1 NaN True ECO:0000269 \n", - "2 NaN NaN NaN \n", - "3 NaN NaN NaN \n", - "4 NaN True ECO:0000269 \n", + " Parent accession allele alternativeSequence association_description \\\n", + "0 NaN P00439 NaN del NaN \n", + "1 NaN P00439 NaN del NaN \n", + "2 NaN P00439 NaN K NaN \n", + "3 NaN P00439 NaN del NaN \n", + "4 NaN P00439 NaN L NaN \n", + "\n", + " association_disease association_evidences_code \\\n", + "0 NaN NaN \n", + "1 NaN NaN \n", + "2 NaN NaN \n", + "3 NaN NaN \n", + "4 True ECO:0000269 \n", "\n", " association_evidences_source_alternativeUrl association_evidences_source_id \\\n", "0 NaN NaN \n", - "1 http://europepmc.org/abstract/MED/22513348 22513348 \n", + "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 http://europepmc.org/abstract/MED/23792259 23792259 \n", "\n", " association_evidences_source_name \\\n", "0 NaN \n", - "1 PubMed \n", + "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 PubMed \n", @@ -1028,23 +1062,23 @@ " ... siftScore somaticStatus \\\n", "0 ... NaN 0.0 \n", "1 ... NaN 0.0 \n", - "2 ... 0.01 0.0 \n", + "2 ... 0 0.0 \n", "3 ... NaN 0.0 \n", "4 ... NaN 0.0 \n", "\n", - " sourceType taxid translation type wildType xrefs_id \\\n", - "0 uniprot 9606.0 NaN VARIANT d NaN \n", - "1 uniprot 9606.0 NaN VARIANT L NaN \n", - "2 large_scale_study 9606.0 NaN VARIANT F COSM1510886 \n", - "3 large_scale_study 9606.0 ENSP00000448059 VARIANT Y rs62507332 \n", - "4 uniprot 9606.0 NaN VARIANT F NaN \n", + " sourceType taxid translation type wildType xrefs_id \\\n", + "0 uniprot 9606.0 NaN VARIANT L NaN \n", + "1 uniprot 9606.0 NaN VARIANT Y NaN \n", + "2 large_scale_study 9606.0 NaN VARIANT T COSM546084 \n", + "3 uniprot 9606.0 NaN VARIANT L NaN \n", + "4 uniprot 9606.0 NaN VARIANT F NaN \n", "\n", - " xrefs_name xrefs_url \n", - "0 NaN NaN \n", - "1 NaN NaN \n", - "2 cosmic curated http://cancer.sanger.ac.uk/cosmic/mutation/ove... \n", - "3 [1000Genomes, ExAC] [http://www.ensembl.org/Homo_sapiens/Variation... \n", - "4 NaN NaN \n", + " xrefs_name xrefs_url \n", + "0 NaN NaN \n", + "1 NaN NaN \n", + "2 cosmic curated http://cancer.sanger.ac.uk/cosmic/mutation/ove... \n", + "3 NaN NaN \n", + "4 NaN NaN \n", "\n", "[5 rows x 50 columns]\n" ] @@ -1103,7 +1137,7 @@ "\n", " wildType xrefs_id xrefs_name \\\n", "0 V rs776442422 ExAC \n", - "1 P rs398123292 (ExAC, 1000Genomes) \n", + "1 P rs398123292 (1000Genomes, ExAC) \n", "2 P rs374999809 (ExAC, ESP) \n", "3 W rs775327122 ExAC \n", "4 F NaN NaN \n", @@ -1111,7 +1145,7 @@ " xrefs_url \n", "0 http://exac.broadinstitute.org/awesome?query=r... \n", "1 (http://www.ensembl.org/Homo_sapiens/Variation... \n", - "2 (http://exac.broadinstitute.org/awesome?query=... \n", + "2 (http://evs.gs.washington.edu/EVS/PopStatsServ... \n", "3 http://exac.broadinstitute.org/awesome?query=r... \n", "4 NaN \n", "\n", @@ -1209,7 +1243,7 @@ "\n", " wildType xrefs_id xrefs_name \\\n", "0 V rs776442422 ExAC \n", - "1 P rs398123292 (ExAC, 1000Genomes) \n", + "1 P rs398123292 (1000Genomes, ExAC) \n", "2 P rs374999809 (ExAC, ESP) \n", "3 W rs775327122 ExAC \n", "4 F NaN NaN \n", @@ -1217,7 +1251,7 @@ " xrefs_url \n", "0 http://exac.broadinstitute.org/awesome?query=r... \n", "1 (http://www.ensembl.org/Homo_sapiens/Variation... \n", - "2 (http://exac.broadinstitute.org/awesome?query=... \n", + "2 (http://evs.gs.washington.edu/EVS/PopStatsServ... \n", "3 http://exac.broadinstitute.org/awesome?query=r... \n", "4 NaN \n", "\n", @@ -1234,13 +1268,814 @@ " residue_agg='centroid', overwrite=False)\n", "print(table.head())" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Use case 1: characterising the structural properties of protein posttranslational modified sites (or any other site)\n", + "\n", + "One can use ProteoFAV for high-throughput structural characterization of binding sites, such as in Britto-Borges and Barton, 2017.\n", + "\n", + "For example, the cAMP-dependent protein kinase catalytic subunit alpha (PKAα) is a small protein kinase that is critical homeostatic process in human tissue and in stress response in lower organisms [UniProt:P17612](http://www.uniprot.org/uniprot/P17612). Accordinly, the function of the protein has been extensively studied, including the three dimensional structure with high sequence coverage and resolution.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "uniprot_id = 'P17612'\n", + "gff_path = os.path.join(out_dir, uniprot_id + \".gff\")\n", + "\n", + "Annotation.download(\n", + " identifier=uniprot_id, \n", + " filename=gff_path)\n", + "P17612_annotation = Annotation.read(filename=gff_path)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + " | NAME | \n", + "SOURCE | \n", + "TYPE | \n", + "START | \n", + "END | \n", + "SCORE | \n", + "STRAND | \n", + "FRAME | \n", + "GROUP | \n", + "Dbxref | \n", + "ID | \n", + "Note | \n", + "Ontology_term | \n", + "evidence | \n", + "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | \n", + "P17612 | \n", + "UniProtKB | \n", + "Modified residue | \n", + "11 | \n", + "11 | \n", + ". | \n", + ". | \n", + ". | \n", + "Note=Phosphoserine%3B by autocatalysis;Ontolog... | \n", + "NaN | \n", + "NaN | \n", + "[Phosphoserine; by autocatalysis] | \n", + "[ECO:0000250] | \n", + "[ECO:0000250|UniProtKB:P05132] | \n", + "
11 | \n", + "P17612 | \n", + "UniProtKB | \n", + "Modified residue | \n", + "49 | \n", + "49 | \n", + ". | \n", + ". | \n", + ". | \n", + "Note=Phosphothreonine;Ontology_term=ECO:000024... | \n", + "[PMID:18691976] | \n", + "NaN | \n", + "[Phosphothreonine] | \n", + "[ECO:0000244] | \n", + "[ECO:0000244|PubMed:18691976] | \n", + "
12 | \n", + "P17612 | \n", + "UniProtKB | \n", + "Modified residue | \n", + "140 | \n", + "140 | \n", + ". | \n", + ". | \n", + ". | \n", + "Note=Phosphoserine;Ontology_term=ECO:0000250;e... | \n", + "NaN | \n", + "NaN | \n", + "[Phosphoserine] | \n", + "[ECO:0000250] | \n", + "[ECO:0000250|UniProtKB:P05132] | \n", + "
13 | \n", + "P17612 | \n", + "UniProtKB | \n", + "Modified residue | \n", + "196 | \n", + "196 | \n", + ". | \n", + ". | \n", + ". | \n", + "Note=Phosphothreonine;Ontology_term=ECO:000026... | \n", + "[PMID:12372837] | \n", + "NaN | \n", + "[Phosphothreonine] | \n", + "[ECO:0000269] | \n", + "[ECO:0000269|PubMed:12372837] | \n", + "
14 | \n", + "P17612 | \n", + "UniProtKB | \n", + "Modified residue | \n", + "198 | \n", + "198 | \n", + ". | \n", + ". | \n", + ". | \n", + "Note=Phosphothreonine%3B by PDPK1;Ontology_ter... | \n", + "[PMID:12372837,PMID:16765046,PMID:20137943,PMI... | \n", + "NaN | \n", + "[Phosphothreonine; by PDPK1] | \n", + "[ECO:0000269,ECO:0000269,ECO:0000269,ECO:00002... | \n", + "[ECO:0000269|PubMed:12372837,ECO:0000269|PubMe... | \n", + "
15 | \n", + "P17612 | \n", + "UniProtKB | \n", + "Modified residue | \n", + "202 | \n", + "202 | \n", + ". | \n", + ". | \n", + ". | \n", + "Note=Phosphothreonine;Ontology_term=ECO:000026... | \n", + "[PMID:17909264] | \n", + "NaN | \n", + "[Phosphothreonine] | \n", + "[ECO:0000269] | \n", + "[ECO:0000269|PubMed:17909264] | \n", + "
16 | \n", + "P17612 | \n", + "UniProtKB | \n", + "Modified residue | \n", + "331 | \n", + "331 | \n", + ". | \n", + ". | \n", + ". | \n", + "Note=Phosphotyrosine;Ontology_term=ECO:0000250... | \n", + "NaN | \n", + "NaN | \n", + "[Phosphotyrosine] | \n", + "[ECO:0000250] | \n", + "[ECO:0000250|UniProtKB:P05132] | \n", + "
17 | \n", + "P17612 | \n", + "UniProtKB | \n", + "Modified residue | \n", + "339 | \n", + "339 | \n", + ". | \n", + ". | \n", + ". | \n", + "Note=Phosphoserine;Ontology_term=ECO:0000244,E... | \n", + "[PMID:18691976,PMID:19690332,PMID:24275569,PMI... | \n", + "NaN | \n", + "[Phosphoserine] | \n", + "[ECO:0000244,ECO:0000244,ECO:0000244,ECO:00002... | \n", + "[ECO:0000244|PubMed:18691976,ECO:0000244|PubMe... | \n", + "
\n", + " | index | \n", + "pdbx_PDB_model_num | \n", + "auth_asym_id | \n", + "auth_seq_id | \n", + "group_PDB | \n", + "id | \n", + "type_symbol | \n", + "label_atom_id | \n", + "label_alt_id | \n", + "label_comp_id | \n", + "... | \n", + "CATH_regionResNum | \n", + "CATH_dbAccessionId | \n", + "Pfam_regionId | \n", + "Pfam_regionStart | \n", + "Pfam_regionEnd | \n", + "Pfam_regionResNum | \n", + "Pfam_dbAccessionId | \n", + "annotation | \n", + "site | \n", + "accession | \n", + "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
35 | \n", + "286 | \n", + "1 | \n", + "A | \n", + "48 | \n", + "ATOM | \n", + "277 | \n", + "N | \n", + "N | \n", + ". | \n", + "THR | \n", + "... | \n", + "49 | \n", + "3.30.200.20 | \n", + "1 | \n", + "44.0 | \n", + "298.0 | \n", + "49 | \n", + "PF00069 | \n", + "Domain: ['Protein kinase'] (nan), Modified res... | \n", + "49 | \n", + "P17612 | \n", + "
130 | \n", + "39 | \n", + "1 | \n", + "A | \n", + "139 | \n", + "ATOM | \n", + "1040 | \n", + "N | \n", + "N | \n", + ". | \n", + "SER | \n", + "... | \n", + "140 | \n", + "1.10.510.10 | \n", + "1 | \n", + "44.0 | \n", + "298.0 | \n", + "140 | \n", + "PF00069 | \n", + "Domain: ['Protein kinase'] (nan), Modified res... | \n", + "140 | \n", + "P17612 | \n", + "
189 | \n", + "101 | \n", + "1 | \n", + "A | \n", + "195 | \n", + "ATOM | \n", + "1517 | \n", + "N | \n", + "N | \n", + ". | \n", + "THR | \n", + "... | \n", + "196 | \n", + "1.10.510.10 | \n", + "1 | \n", + "44.0 | \n", + "298.0 | \n", + "196 | \n", + "PF00069 | \n", + "Domain: ['Protein kinase'] (nan), Modified res... | \n", + "196 | \n", + "P17612 | \n", + "
191 | \n", + "103 | \n", + "1 | \n", + "A | \n", + "197 | \n", + "HETATM | \n", + "1538 | \n", + "N | \n", + "N | \n", + ". | \n", + "TPO | \n", + "... | \n", + "198 | \n", + "1.10.510.10 | \n", + "1 | \n", + "44.0 | \n", + "298.0 | \n", + "198 | \n", + "PF00069 | \n", + "Domain: ['Protein kinase'] (nan), Modified res... | \n", + "198 | \n", + "P17612 | \n", + "
195 | \n", + "109 | \n", + "1 | \n", + "A | \n", + "201 | \n", + "ATOM | \n", + "1567 | \n", + "N | \n", + "N | \n", + ". | \n", + "THR | \n", + "... | \n", + "202 | \n", + "1.10.510.10 | \n", + "1 | \n", + "44.0 | \n", + "298.0 | \n", + "202 | \n", + "PF00069 | \n", + "Domain: ['Protein kinase'] (nan), Mutagenesis:... | \n", + "202 | \n", + "P17612 | \n", + "
326 | \n", + "251 | \n", + "1 | \n", + "A | \n", + "330 | \n", + "ATOM | \n", + "2586 | \n", + "N | \n", + "N | \n", + ". | \n", + "TYR | \n", + "... | \n", + "331 | \n", + "3.30.200.20 | \n", + "- | \n", + "0.0 | \n", + "0.0 | \n", + "NaN | \n", + "NaN | \n", + "Domain: ['AGC-kinase C-terminal'] (nan), Modif... | \n", + "331 | \n", + "P17612 | \n", + "
334 | \n", + "259 | \n", + "1 | \n", + "A | \n", + "338 | \n", + "HETATM | \n", + "2648 | \n", + "N | \n", + "N | \n", + ". | \n", + "SEP | \n", + "... | \n", + "339 | \n", + "3.30.200.20 | \n", + "- | \n", + "0.0 | \n", + "0.0 | \n", + "NaN | \n", + "NaN | \n", + "Domain: ['AGC-kinase C-terminal'] (nan), Modif... | \n", + "339 | \n", + "P17612 | \n", + "
7 rows × 91 columns
\n", + "