diff --git a/Examples.ipynb b/Examples.ipynb index 273a252..2eb5866 100644 --- a/Examples.ipynb +++ b/Examples.ipynb @@ -17,22 +17,27 @@ "View instructions provided in the main README.md available at https://github.com/bartongroup/ProteoFAV" ] }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import proteofav" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Example Usage" + "## Configuration" ] }, { - "cell_type": "code", - "execution_count": 1, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "# ipython magics to keep reloading the project (during testing)\n", - "%load_ext autoreload\n", - "%autoreload 2" + "ProteoFAV implements two approaches to handle datasets. One can fetch a few files on the fly using functions conveniently provided. For large scale studies, however, is preferable to use a local source for the multiple data used, such as the mmCIF files for three-dimensional protein structures." ] }, { @@ -59,9 +64,7 @@ { "cell_type": "code", "execution_count": 3, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "import os\n", @@ -218,6 +221,13 @@ "print(mmcif_bio.columns)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For a forma description of each colum please see http://mmcif.wwpdb.org/" + ] + }, { "cell_type": "code", "execution_count": 6, @@ -259,7 +269,8 @@ } ], "source": [ - "# PDB Lines are parsed so that column names mimic those of the mmCIF format\n", + "# Column names mimic of a PDB file mimics those of the mmCIF format\n", + "# Please prefer processing mmCIF instead PDB, which were deprecated\n", "pdb = PDB.read(filename=out_pdb)\n", "print(pdb.head())\n", "print(pdb.columns)" @@ -269,15 +280,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Dowloading a SIFTS xml record for obtaining PDB-UniProt mapping" + "### Dowloading a SIFTS xml record for obtaining PDB-UniProt mapping" ] }, { "cell_type": "code", "execution_count": 7, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "from proteofav.sifts import SIFTS\n", @@ -357,6 +366,21 @@ "print(sifts.head())" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The SIFT record also contains mappings to many other databases, such as:\n", + "- CATH\n", + "- SCOP\n", + "- PFAM\n", + "\n", + "Bear in mind that SIFT mapping occurs at residue, but also at the domain level. \n", + "The default action is to load the residue mapping.\n", + "\n", + "Also see the *PDB_Annotation* which flags several types of annotation at residue level, for example whether a given UniProt residues was observed." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -457,54 +481,47 @@ "name": "stdout", "output_type": "stream", "text": [ - " validation_rama validation_ligRSRnbrMean validation_chain validation_ent \\\n", - "0 NaN NaN A 1 \n", - "1 Favored NaN A 1 \n", - "2 Favored NaN A 1 \n", - "3 Favored NaN A 1 \n", - "4 Favored NaN A 1 \n", - "\n", - " validation_altcode validation_rota validation_icode \\\n", - "0 . t ? \n", - "1 . Cg_exo ? \n", - "2 . t90 ? \n", - "3 . p90 ? \n", - "4 . Cg_endo ? \n", - "\n", - " validation_ligRSRnbrStdev validation_ligRSRnumnbrs validation_resname \\\n", - "0 NaN NaN VAL \n", - "1 NaN NaN PRO \n", - "2 NaN NaN TRP \n", - "3 NaN NaN PHE \n", - "4 NaN NaN PRO \n", - "\n", - " ... validation_lig_rsrz_nbr_id validation_ligRSRZ \\\n", - "0 ... NaN NaN \n", - "1 ... NaN NaN \n", - "2 ... NaN NaN \n", - "3 ... NaN NaN \n", - "4 ... NaN NaN \n", - "\n", - " validation_flippable-sidechain validation_model validation_said \\\n", - "0 NaN 1 A \n", - "1 NaN 1 A \n", - "2 NaN 1 A \n", - "3 NaN 1 A \n", - "4 NaN 1 A \n", - "\n", - " validation_rsrz validation_psi validation_mogul-ignore validation_owab \\\n", - "0 -0.160 NaN NaN 52.97 \n", - "1 -0.274 149.9 NaN 28.84 \n", - "2 -0.874 139.2 NaN 33.47 \n", - "3 -0.308 150.9 NaN 39.98 \n", - "4 -0.204 130.9 NaN 26.14 \n", - "\n", - " validation_rscc \n", - "0 0.896 \n", - "1 0.960 \n", - "2 0.961 \n", - "3 0.920 \n", - "4 0.973 \n", + " validation_rscc validation_rama validation_icode validation_ligRSRZ \\\n", + "0 0.896 NaN ? NaN \n", + "1 0.960 Favored ? NaN \n", + "2 0.961 Favored ? NaN \n", + "3 0.920 Favored ? NaN \n", + "4 0.973 Favored ? NaN \n", + "\n", + " validation_ligRSRnbrMean validation_flippable-sidechain validation_psi \\\n", + "0 NaN NaN NaN \n", + "1 NaN NaN 149.9 \n", + "2 NaN NaN 139.2 \n", + "3 NaN NaN 150.9 \n", + "4 NaN NaN 130.9 \n", + "\n", + " validation_rsr validation_owab validation_ligRSRnumnbrs ... \\\n", + "0 0.233 52.97 NaN ... \n", + "1 0.190 28.84 NaN ... \n", + "2 0.154 33.47 NaN ... \n", + "3 0.229 39.98 NaN ... \n", + "4 0.197 26.14 NaN ... \n", + "\n", + " validation_chain validation_phi validation_said validation_rsrz \\\n", + "0 A NaN A -0.160 \n", + "1 A -51.6 A -0.274 \n", + "2 A -81.0 A -0.874 \n", + "3 A -138.9 A -0.308 \n", + "4 A -65.3 A -0.204 \n", + "\n", + " validation_seq validation_ligRSRnbrStdev validation_altcode \\\n", + "0 1 NaN . \n", + "1 2 NaN . \n", + "2 3 NaN . \n", + "3 4 NaN . \n", + "4 5 NaN . \n", + "\n", + " validation_lig_rsrz_nbr_id validation_NatomsEDS validation_resnum \n", + "0 NaN 7 118 \n", + "1 NaN 7 119 \n", + "2 NaN 14 120 \n", + "3 NaN 11 121 \n", + "4 NaN 7 122 \n", "\n", "[5 rows x 27 columns]\n" ] @@ -515,6 +532,13 @@ "print(validation.head())" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "PDB validation record is convenient when filtering a protein structure for analysis." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -522,6 +546,13 @@ "### Select only CA residues in for a single chain" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Protein structure representation is a hierarchical data structure (See http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ). So to obtain the data in tabular format, ProteoFAV transforms the data. For example, for use cases that require one residue per row, the residue three-dimensional coordinates can be represented by the residue's Cα. Other filtering parameters are obtained with *filter_structures*" + ] + }, { "cell_type": "code", "execution_count": 13, @@ -576,7 +607,7 @@ "mmcif_sel = filter_structures(mmcif, excluded_cols=None,\n", " models='first', chains='A', res=None, res_full=None,\n", " comps=None, atoms='CA', lines=None, category='auth',\n", - " residue_agg=False, agg_method='centroid',\n", + " residue_agg=False, \n", " add_res_full=True, add_atom_altloc=False, reset_atom_id=True,\n", " remove_altloc=False, remove_hydrogens=True, remove_partial_res=False)\n", "print(mmcif_sel.head())" @@ -586,7 +617,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Aggregating atoms residue-by-residue" + "### Aggregating atoms residue-by-residue\n", + "Three dimensional coordinates of all atoms can be represented by the residues centroid" ] }, { @@ -646,9 +678,7 @@ }, { "cell_type": "markdown", - "metadata": { - "collapsed": true - }, + "metadata": {}, "source": [ "### Write a PDB-formatted file from a mmCIF structure" ] @@ -740,7 +770,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Downloading a sequence Annotation (GFF) from UniProt" + "### Downloading a sequence Annotation (GFF) from UniProt\n", + "UniProt provides extensive, high-quality annotation for residues in proteins" ] }, { @@ -762,7 +793,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Loading the sequence Annotation" + "### Loading the sequence Annotation\n", + "Note also that GFF files althoug tabular, contains some extra level nesting in the `GROUP` column. ProteoFAV tries to deconvolute this information" ] }, { @@ -813,25 +845,26 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Downloading variants based on the UniProt ID" + "### Downloading variants based on the UniProt ID\n", + "We could fetch genetic variants from UniProt and Ensembl with:\n", + "\n", + "```python\n", + "Variants.fetch(identifier=uniprot_ids[0], id_source='uniprot', \n", + " synonymous=False, uniprot_vars=True,\n", + " ensembl_germline_vars=True, ensembl_somatic_vars=True)\n", + "```\n", + "\n", + "but `select_variants` handles merging of Ensembl vars for us" ] }, { "cell_type": "code", "execution_count": 19, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "from proteofav.variants import Variants\n", "\n", - "# we could simply fetch the variants from UniProt and Ensembl\n", - "# Variants.fetch(identifier=uniprot_ids[0], id_source='uniprot', \n", - "# synonymous=False, uniprot_vars=True,\n", - "# ensembl_germline_vars=True, ensembl_somatic_vars=True)\n", - "\n", - "# but `select_variants` handles merging of Ensembl vars for us\n", "uniprot, ensembl = Variants.select(identifier=uniprot_ids[0], id_source='uniprot', \n", " synonymous=False, uniprot_vars=True,\n", " ensembl_germline_vars=True, ensembl_somatic_vars=True)\n" @@ -854,79 +887,79 @@ "output_type": "stream", "text": [ " accession alternativeSequence association_description association_disease \\\n", - "0 P00439 C mild True \n", - "1 P00439 N NaN True \n", - "2 P00439 del NaN NaN \n", - "3 P00439 F NaN True \n", - "4 P00439 L haplotypes 1,4 True \n", + "0 P00439 A NaN True \n", + "1 P00439 L haplotypes 1,4 True \n", + "2 P00439 L NaN True \n", + "3 P00439 S haplotype 36 True \n", + "4 P00439 V NaN True \n", "\n", " association_evidences_code \\\n", "0 ECO:0000269 \n", "1 ECO:0000269 \n", "2 NaN \n", - "3 NaN \n", + "3 ECO:0000269 \n", "4 ECO:0000269 \n", "\n", " association_evidences_source_alternativeUrl \\\n", - "0 http://europepmc.org/abstract/MED/9048935 \n", - "1 [http://europepmc.org/abstract/MED/12501224, h... \n", + "0 [http://europepmc.org/abstract/MED/22513348, h... \n", + "1 [http://europepmc.org/abstract/MED/22513348, h... \n", "2 NaN \n", - "3 NaN \n", - "4 [http://europepmc.org/abstract/MED/12501224, h... \n", + "3 http://europepmc.org/abstract/MED/2014802 \n", + "4 [http://europepmc.org/abstract/MED/8889590, ht... \n", "\n", " association_evidences_source_id \\\n", - "0 9048935 \n", - "1 [12501224, 22513348, 1358789] \n", + "0 [22513348, 8889590, 8088845, 12501224] \n", + "1 [22513348, 1672290, 8889590, 12501224, 1672294] \n", "2 NaN \n", - "3 NaN \n", - "4 [12501224, 1672290, 8889590, 22513348, 1672294] \n", + "3 2014802 \n", + "4 [8889590, 12501224, 22513348, 23792259] \n", "\n", " association_evidences_source_name \\\n", "0 PubMed \n", "1 PubMed \n", "2 NaN \n", - "3 NaN \n", + "3 PubMed \n", "4 PubMed \n", "\n", " association_evidences_source_url \\\n", - "0 http://www.ncbi.nlm.nih.gov/pubmed/9048935 \n", - "1 [http://www.ncbi.nlm.nih.gov/pubmed/12501224, ... \n", + "0 [http://www.ncbi.nlm.nih.gov/pubmed/22513348, ... \n", + "1 [http://www.ncbi.nlm.nih.gov/pubmed/22513348, ... \n", "2 NaN \n", - "3 NaN \n", - "4 [http://www.ncbi.nlm.nih.gov/pubmed/12501224, ... \n", + "3 http://www.ncbi.nlm.nih.gov/pubmed/2014802 \n", + "4 [http://www.ncbi.nlm.nih.gov/pubmed/8889590, h... \n", "\n", " association_name \\\n", - "0 Phenylketonuria (PKU) \n", - "1 [Phenylketonuria (PKU), Hyperphenylalaninemia ... \n", - "2 NaN \n", + "0 [Phenylketonuria (PKU), Hyperphenylalaninemia ... \n", + "1 Phenylketonuria (PKU) \n", + "2 Hyperphenylalaninemia (HPA) \n", "3 Phenylketonuria (PKU) \n", "4 Phenylketonuria (PKU) \n", "\n", " ... siftPrediction siftScore \\\n", - "0 ... deleterious 0 \n", - "1 ... tolerated 1 \n", + "0 ... tolerated 0.11 \n", + "1 ... deleterious 0 \n", "2 ... NaN NaN \n", "3 ... NaN NaN \n", - "4 ... deleterious 0 \n", - "\n", - " somaticStatus sourceType taxid type wildType xrefs_id \\\n", - "0 0 uniprot 9606 VARIANT Y rs62514927 \n", - "1 0 uniprot 9606 VARIANT D rs62644499 \n", - "2 0 uniprot 9606 VARIANT d NaN \n", - "3 0 uniprot 9606 VARIANT S rs62508577 \n", - "4 0 uniprot 9606 VARIANT P rs5030851 \n", - "\n", - " xrefs_name \\\n", - "0 [dbSNP, Ensembl, ExAC] \n", - "1 [dbSNP, Ensembl, ExAC] \n", - "2 NaN \n", - "3 [dbSNP, Ensembl] \n", - "4 [dbSNP, Ensembl, ESP, ExAC] \n", + "4 ... tolerated 0.06 \n", + "\n", + " somaticStatus sourceType taxid type wildType xrefs_id \\\n", + "0 0 mixed 9606 VARIANT V rs796052017 \n", + "1 0 mixed 9606 VARIANT P rs5030851 \n", + "2 0 uniprot 9606 VARIANT Q rs199475662 \n", + "3 0 uniprot 9606 VARIANT L rs62642930 \n", + "4 0 mixed 9606 VARIANT A rs5030857 \n", + "\n", + " xrefs_name \\\n", + "0 [dbSNP, Ensembl, 1000Genomes, ESP, ExAC] \n", + "1 [dbSNP, Ensembl, ESP, ExAC] \n", + "2 [dbSNP, Ensembl] \n", + "3 [dbSNP, Ensembl] \n", + "4 [dbSNP, Ensembl, 1000Genomes, ESP, ExAC] \n", "\n", " xrefs_url \n", "0 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", "1 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", - "2 NaN \n", + "2 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", "3 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", "4 [http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?t... \n", "\n", @@ -948,18 +981,18 @@ "output_type": "stream", "text": [ " Parent allele begin clinical_significance codons \\\n", - "0 ENST00000553106 HGMD_MUTATION 257 [] \n", + "0 ENST00000553106 HGMD_MUTATION 377 [] \n", "1 ENST00000553106 C/T 75 [] Gat/Aat \n", - "2 ENST00000553106 HGMD_MUTATION 47 [] \n", - "3 ENST00000553106 HGMD_MUTATION 100 [] \n", - "4 ENST00000553106 HGMD_MUTATION 261 [] \n", + "2 ENST00000553106 HGMD_MUTATION 300 [] \n", + "3 ENST00000553106 HGMD_MUTATION 245 [] \n", + "4 ENST00000553106 HGMD_MUTATION 415 [] \n", "\n", " consequenceType end feature_type frequency \\\n", - "0 coding_sequence_variant 257 transcript_variation NaN \n", + "0 coding_sequence_variant 377 transcript_variation NaN \n", "1 missense_variant 75 transcript_variation NaN \n", - "2 coding_sequence_variant 47 transcript_variation NaN \n", - "3 coding_sequence_variant 100 transcript_variation NaN \n", - "4 coding_sequence_variant 261 transcript_variation NaN \n", + "2 coding_sequence_variant 300 transcript_variation NaN \n", + "3 coding_sequence_variant 245 transcript_variation NaN \n", + "4 coding_sequence_variant 415 transcript_variation NaN \n", "\n", " polyphenScore residues seq_region_name siftScore translation \\\n", "0 NaN ENSP00000448059 NaN ENSP00000448059 \n", @@ -969,11 +1002,11 @@ "4 NaN ENSP00000448059 NaN ENSP00000448059 \n", "\n", " xrefs_id \n", - "0 CM010966 \n", + "0 CD011183 \n", "1 rs767453024 \n", - "2 CM941126 \n", - "3 CM992944 \n", - "4 CM910287 \n" + "2 CM950893 \n", + "3 CM941133 \n", + "4 CM920564 \n" ] } ], @@ -985,7 +1018,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Merging down the two Variants tables" + "### Merging down the two Variants tables\n", + "For merging variants from the UniProt and Ensembl" ] }, { @@ -997,30 +1031,30 @@ "name": "stdout", "output_type": "stream", "text": [ - " Parent accession allele alternativeSequence \\\n", - "0 NaN P00439 NaN del \n", - "1 NaN P00439 NaN P \n", - "2 NaN P00439 NaN Y \n", - "3 ENST00000553106 P00439 A/C/G * \n", - "4 NaN P00439 NaN L \n", - "\n", - " association_description association_disease association_evidences_code \\\n", - "0 NaN NaN NaN \n", - "1 NaN True ECO:0000269 \n", - "2 NaN NaN NaN \n", - "3 NaN NaN NaN \n", - "4 NaN True ECO:0000269 \n", + " Parent accession allele alternativeSequence association_description \\\n", + "0 NaN P00439 NaN del NaN \n", + "1 NaN P00439 NaN del NaN \n", + "2 NaN P00439 NaN K NaN \n", + "3 NaN P00439 NaN del NaN \n", + "4 NaN P00439 NaN L NaN \n", + "\n", + " association_disease association_evidences_code \\\n", + "0 NaN NaN \n", + "1 NaN NaN \n", + "2 NaN NaN \n", + "3 NaN NaN \n", + "4 True ECO:0000269 \n", "\n", " association_evidences_source_alternativeUrl association_evidences_source_id \\\n", "0 NaN NaN \n", - "1 http://europepmc.org/abstract/MED/22513348 22513348 \n", + "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 http://europepmc.org/abstract/MED/23792259 23792259 \n", "\n", " association_evidences_source_name \\\n", "0 NaN \n", - "1 PubMed \n", + "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 PubMed \n", @@ -1028,23 +1062,23 @@ " ... siftScore somaticStatus \\\n", "0 ... NaN 0.0 \n", "1 ... NaN 0.0 \n", - "2 ... 0.01 0.0 \n", + "2 ... 0 0.0 \n", "3 ... NaN 0.0 \n", "4 ... NaN 0.0 \n", "\n", - " sourceType taxid translation type wildType xrefs_id \\\n", - "0 uniprot 9606.0 NaN VARIANT d NaN \n", - "1 uniprot 9606.0 NaN VARIANT L NaN \n", - "2 large_scale_study 9606.0 NaN VARIANT F COSM1510886 \n", - "3 large_scale_study 9606.0 ENSP00000448059 VARIANT Y rs62507332 \n", - "4 uniprot 9606.0 NaN VARIANT F NaN \n", + " sourceType taxid translation type wildType xrefs_id \\\n", + "0 uniprot 9606.0 NaN VARIANT L NaN \n", + "1 uniprot 9606.0 NaN VARIANT Y NaN \n", + "2 large_scale_study 9606.0 NaN VARIANT T COSM546084 \n", + "3 uniprot 9606.0 NaN VARIANT L NaN \n", + "4 uniprot 9606.0 NaN VARIANT F NaN \n", "\n", - " xrefs_name xrefs_url \n", - "0 NaN NaN \n", - "1 NaN NaN \n", - "2 cosmic curated http://cancer.sanger.ac.uk/cosmic/mutation/ove... \n", - "3 [1000Genomes, ExAC] [http://www.ensembl.org/Homo_sapiens/Variation... \n", - "4 NaN NaN \n", + " xrefs_name xrefs_url \n", + "0 NaN NaN \n", + "1 NaN NaN \n", + "2 cosmic curated http://cancer.sanger.ac.uk/cosmic/mutation/ove... \n", + "3 NaN NaN \n", + "4 NaN NaN \n", "\n", "[5 rows x 50 columns]\n" ] @@ -1103,7 +1137,7 @@ "\n", " wildType xrefs_id xrefs_name \\\n", "0 V rs776442422 ExAC \n", - "1 P rs398123292 (ExAC, 1000Genomes) \n", + "1 P rs398123292 (1000Genomes, ExAC) \n", "2 P rs374999809 (ExAC, ESP) \n", "3 W rs775327122 ExAC \n", "4 F NaN NaN \n", @@ -1111,7 +1145,7 @@ " xrefs_url \n", "0 http://exac.broadinstitute.org/awesome?query=r... \n", "1 (http://www.ensembl.org/Homo_sapiens/Variation... \n", - "2 (http://exac.broadinstitute.org/awesome?query=... \n", + "2 (http://evs.gs.washington.edu/EVS/PopStatsServ... \n", "3 http://exac.broadinstitute.org/awesome?query=r... \n", "4 NaN \n", "\n", @@ -1209,7 +1243,7 @@ "\n", " wildType xrefs_id xrefs_name \\\n", "0 V rs776442422 ExAC \n", - "1 P rs398123292 (ExAC, 1000Genomes) \n", + "1 P rs398123292 (1000Genomes, ExAC) \n", "2 P rs374999809 (ExAC, ESP) \n", "3 W rs775327122 ExAC \n", "4 F NaN NaN \n", @@ -1217,7 +1251,7 @@ " xrefs_url \n", "0 http://exac.broadinstitute.org/awesome?query=r... \n", "1 (http://www.ensembl.org/Homo_sapiens/Variation... \n", - "2 (http://exac.broadinstitute.org/awesome?query=... \n", + "2 (http://evs.gs.washington.edu/EVS/PopStatsServ... \n", "3 http://exac.broadinstitute.org/awesome?query=r... \n", "4 NaN \n", "\n", @@ -1234,13 +1268,814 @@ " residue_agg='centroid', overwrite=False)\n", "print(table.head())" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Use case 1: characterising the structural properties of protein posttranslational modified sites (or any other site)\n", + "\n", + "One can use ProteoFAV for high-throughput structural characterization of binding sites, such as in Britto-Borges and Barton, 2017.\n", + "\n", + "For example, the cAMP-dependent protein kinase catalytic subunit alpha (PKAα) is a small protein kinase that is critical homeostatic process in human tissue and in stress response in lower organisms [UniProt:P17612](http://www.uniprot.org/uniprot/P17612). Accordinly, the function of the protein has been extensively studied, including the three dimensional structure with high sequence coverage and resolution.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "uniprot_id = 'P17612'\n", + "gff_path = os.path.join(out_dir, uniprot_id + \".gff\")\n", + "\n", + "Annotation.download(\n", + " identifier=uniprot_id, \n", + " filename=gff_path)\n", + "P17612_annotation = Annotation.read(filename=gff_path)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NAMESOURCETYPESTARTENDSCORESTRANDFRAMEGROUPDbxrefIDNoteOntology_termevidence
10P17612UniProtKBModified residue1111...Note=Phosphoserine%3B by autocatalysis;Ontolog...NaNNaN[Phosphoserine; by autocatalysis][ECO:0000250][ECO:0000250|UniProtKB:P05132]
11P17612UniProtKBModified residue4949...Note=Phosphothreonine;Ontology_term=ECO:000024...[PMID:18691976]NaN[Phosphothreonine][ECO:0000244][ECO:0000244|PubMed:18691976]
12P17612UniProtKBModified residue140140...Note=Phosphoserine;Ontology_term=ECO:0000250;e...NaNNaN[Phosphoserine][ECO:0000250][ECO:0000250|UniProtKB:P05132]
13P17612UniProtKBModified residue196196...Note=Phosphothreonine;Ontology_term=ECO:000026...[PMID:12372837]NaN[Phosphothreonine][ECO:0000269][ECO:0000269|PubMed:12372837]
14P17612UniProtKBModified residue198198...Note=Phosphothreonine%3B by PDPK1;Ontology_ter...[PMID:12372837,PMID:16765046,PMID:20137943,PMI...NaN[Phosphothreonine; by PDPK1][ECO:0000269,ECO:0000269,ECO:0000269,ECO:00002...[ECO:0000269|PubMed:12372837,ECO:0000269|PubMe...
15P17612UniProtKBModified residue202202...Note=Phosphothreonine;Ontology_term=ECO:000026...[PMID:17909264]NaN[Phosphothreonine][ECO:0000269][ECO:0000269|PubMed:17909264]
16P17612UniProtKBModified residue331331...Note=Phosphotyrosine;Ontology_term=ECO:0000250...NaNNaN[Phosphotyrosine][ECO:0000250][ECO:0000250|UniProtKB:P05132]
17P17612UniProtKBModified residue339339...Note=Phosphoserine;Ontology_term=ECO:0000244,E...[PMID:18691976,PMID:19690332,PMID:24275569,PMI...NaN[Phosphoserine][ECO:0000244,ECO:0000244,ECO:0000244,ECO:00002...[ECO:0000244|PubMed:18691976,ECO:0000244|PubMe...
\n", + "
" + ], + "text/plain": [ + " NAME SOURCE TYPE START END SCORE STRAND FRAME \\\n", + "10 P17612 UniProtKB Modified residue 11 11 . . . \n", + "11 P17612 UniProtKB Modified residue 49 49 . . . \n", + "12 P17612 UniProtKB Modified residue 140 140 . . . \n", + "13 P17612 UniProtKB Modified residue 196 196 . . . \n", + "14 P17612 UniProtKB Modified residue 198 198 . . . \n", + "15 P17612 UniProtKB Modified residue 202 202 . . . \n", + "16 P17612 UniProtKB Modified residue 331 331 . . . \n", + "17 P17612 UniProtKB Modified residue 339 339 . . . \n", + "\n", + " GROUP \\\n", + "10 Note=Phosphoserine%3B by autocatalysis;Ontolog... \n", + "11 Note=Phosphothreonine;Ontology_term=ECO:000024... \n", + "12 Note=Phosphoserine;Ontology_term=ECO:0000250;e... \n", + "13 Note=Phosphothreonine;Ontology_term=ECO:000026... \n", + "14 Note=Phosphothreonine%3B by PDPK1;Ontology_ter... \n", + "15 Note=Phosphothreonine;Ontology_term=ECO:000026... \n", + "16 Note=Phosphotyrosine;Ontology_term=ECO:0000250... \n", + "17 Note=Phosphoserine;Ontology_term=ECO:0000244,E... \n", + "\n", + " Dbxref ID \\\n", + "10 NaN NaN \n", + "11 [PMID:18691976] NaN \n", + "12 NaN NaN \n", + "13 [PMID:12372837] NaN \n", + "14 [PMID:12372837,PMID:16765046,PMID:20137943,PMI... NaN \n", + "15 [PMID:17909264] NaN \n", + "16 NaN NaN \n", + "17 [PMID:18691976,PMID:19690332,PMID:24275569,PMI... NaN \n", + "\n", + " Note \\\n", + "10 [Phosphoserine; by autocatalysis] \n", + "11 [Phosphothreonine] \n", + "12 [Phosphoserine] \n", + "13 [Phosphothreonine] \n", + "14 [Phosphothreonine; by PDPK1] \n", + "15 [Phosphothreonine] \n", + "16 [Phosphotyrosine] \n", + "17 [Phosphoserine] \n", + "\n", + " Ontology_term \\\n", + "10 [ECO:0000250] \n", + "11 [ECO:0000244] \n", + "12 [ECO:0000250] \n", + "13 [ECO:0000269] \n", + "14 [ECO:0000269,ECO:0000269,ECO:0000269,ECO:00002... \n", + "15 [ECO:0000269] \n", + "16 [ECO:0000250] \n", + "17 [ECO:0000244,ECO:0000244,ECO:0000244,ECO:00002... \n", + "\n", + " evidence \n", + "10 [ECO:0000250|UniProtKB:P05132] \n", + "11 [ECO:0000244|PubMed:18691976] \n", + "12 [ECO:0000250|UniProtKB:P05132] \n", + "13 [ECO:0000269|PubMed:12372837] \n", + "14 [ECO:0000269|PubMed:12372837,ECO:0000269|PubMe... \n", + "15 [ECO:0000269|PubMed:17909264] \n", + "16 [ECO:0000250|UniProtKB:P05132] \n", + "17 [ECO:0000244|PubMed:18691976,ECO:0000244|PubMe... " + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# phosphorylated sites in UniProt\n", + "P17612_annotation[P17612_annotation.GROUP.str.contains('Note=Phospho')]" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "phospho_residues = P17612_annotation.loc[P17612_annotation.GROUP.str.contains('Note=Phospho'), 'START']" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "from proteofav.sifts import sifts_best" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "P17612_best_structure = sifts_best('P17612')['P17612'][0]" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "P17612_best_structure['experimental_method'] == 'X-ray diffraction'" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "P17612_best_structure['tax_id'] == 9606 # human" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "table = Tables.generate(\n", + " merge_tables=True, \n", + " uniprot_id='P17612', \n", + " bio_unit=False,\n", + " sifts=True,\n", + " validation=True, \n", + " annotations=True, \n", + " residue_agg='centroid', \n", + " overwrite=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [], + "source": [ + "# every residue in the structure not mapped to the UniProt is discarded\n", + "table.dropna(subset=['UniProt_dbResNum'], axis=0, inplace=True) " + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [], + "source": [ + "table['UniProt_dbResNum'] = table['UniProt_dbResNum'].astype(int)" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
indexpdbx_PDB_model_numauth_asym_idauth_seq_idgroup_PDBidtype_symbollabel_atom_idlabel_alt_idlabel_comp_id...CATH_regionResNumCATH_dbAccessionIdPfam_regionIdPfam_regionStartPfam_regionEndPfam_regionResNumPfam_dbAccessionIdannotationsiteaccession
352861A48ATOM277NN.THR...493.30.200.20144.0298.049PF00069Domain: ['Protein kinase'] (nan), Modified res...49P17612
130391A139ATOM1040NN.SER...1401.10.510.10144.0298.0140PF00069Domain: ['Protein kinase'] (nan), Modified res...140P17612
1891011A195ATOM1517NN.THR...1961.10.510.10144.0298.0196PF00069Domain: ['Protein kinase'] (nan), Modified res...196P17612
1911031A197HETATM1538NN.TPO...1981.10.510.10144.0298.0198PF00069Domain: ['Protein kinase'] (nan), Modified res...198P17612
1951091A201ATOM1567NN.THR...2021.10.510.10144.0298.0202PF00069Domain: ['Protein kinase'] (nan), Mutagenesis:...202P17612
3262511A330ATOM2586NN.TYR...3313.30.200.20-0.00.0NaNNaNDomain: ['AGC-kinase C-terminal'] (nan), Modif...331P17612
3342591A338HETATM2648NN.SEP...3393.30.200.20-0.00.0NaNNaNDomain: ['AGC-kinase C-terminal'] (nan), Modif...339P17612
\n", + "

7 rows × 91 columns

\n", + "
" + ], + "text/plain": [ + " index pdbx_PDB_model_num auth_asym_id auth_seq_id group_PDB id \\\n", + "35 286 1 A 48 ATOM 277 \n", + "130 39 1 A 139 ATOM 1040 \n", + "189 101 1 A 195 ATOM 1517 \n", + "191 103 1 A 197 HETATM 1538 \n", + "195 109 1 A 201 ATOM 1567 \n", + "326 251 1 A 330 ATOM 2586 \n", + "334 259 1 A 338 HETATM 2648 \n", + "\n", + " type_symbol label_atom_id label_alt_id label_comp_id ... \\\n", + "35 N N . THR ... \n", + "130 N N . SER ... \n", + "189 N N . THR ... \n", + "191 N N . TPO ... \n", + "195 N N . THR ... \n", + "326 N N . TYR ... \n", + "334 N N . SEP ... \n", + "\n", + " CATH_regionResNum CATH_dbAccessionId Pfam_regionId Pfam_regionStart \\\n", + "35 49 3.30.200.20 1 44.0 \n", + "130 140 1.10.510.10 1 44.0 \n", + "189 196 1.10.510.10 1 44.0 \n", + "191 198 1.10.510.10 1 44.0 \n", + "195 202 1.10.510.10 1 44.0 \n", + "326 331 3.30.200.20 - 0.0 \n", + "334 339 3.30.200.20 - 0.0 \n", + "\n", + " Pfam_regionEnd Pfam_regionResNum Pfam_dbAccessionId \\\n", + "35 298.0 49 PF00069 \n", + "130 298.0 140 PF00069 \n", + "189 298.0 196 PF00069 \n", + "191 298.0 198 PF00069 \n", + "195 298.0 202 PF00069 \n", + "326 0.0 NaN NaN \n", + "334 0.0 NaN NaN \n", + "\n", + " annotation site accession \n", + "35 Domain: ['Protein kinase'] (nan), Modified res... 49 P17612 \n", + "130 Domain: ['Protein kinase'] (nan), Modified res... 140 P17612 \n", + "189 Domain: ['Protein kinase'] (nan), Modified res... 196 P17612 \n", + "191 Domain: ['Protein kinase'] (nan), Modified res... 198 P17612 \n", + "195 Domain: ['Protein kinase'] (nan), Mutagenesis:... 202 P17612 \n", + "326 Domain: ['AGC-kinase C-terminal'] (nan), Modif... 331 P17612 \n", + "334 Domain: ['AGC-kinase C-terminal'] (nan), Modif... 339 P17612 \n", + "\n", + "[7 rows x 91 columns]" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table[table['UniProt_dbResNum'].isin(phospho_residues)] " + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 47, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "phospho_residues_b = table.loc[table['UniProt_dbResNum'].isin(phospho_residues), 'B_iso_or_equiv'].mean()\n", + "all_residues_b = table.loc[:, 'B_iso_or_equiv'].mean()\n", + "\n", + "phospho_residues_b > all_residues_b" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Overall phophorylated Ser/Thr have are have high b-factors, hot residues, that is not true for the `3ovv` structure." + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "T 4\n", + "H 2\n", + "E 1\n", + "Name: PDB_codeSecondaryStructure, dtype: int64" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table.loc[table['UniProt_dbResNum'].isin(phospho_residues), 'PDB_codeSecondaryStructure'].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "4 of 7 residues occur on Turns" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Observed'" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table.loc[table['UniProt_dbResNum'].isin(phospho_residues), 'PDB_Annotation'].all()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And all residues were observed in the structure, not labeled in the REM465 field" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "35 Favored\n", + "130 Favored\n", + "189 Favored\n", + "191 NaN\n", + "195 Favored\n", + "326 Favored\n", + "334 NaN\n", + "Name: validation_rama, dtype: object" + ] + }, + "execution_count": 55, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table.loc[table['UniProt_dbResNum'].isin(phospho_residues), 'validation_rama']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "5 out 7 have are not Ramachandran outliers, the NaN values were given for the Phopho resides observed in the protein crystal" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Use case 2: Spatial clustering of genetic variants" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python (proteofav)", "language": "python", - "name": "python3" + "name": "proteofav" }, "language_info": { "codemirror_mode": { @@ -1252,7 +2087,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.3" + "version": "3.6.4" } }, "nbformat": 4,