Skip to content

Computational and Statistical Tools for Research Annotated Publications

Maurice HT Ling edited this page Mar 16, 2017 · 30 revisions

[1] Ling, MHT and So, CW. 2003. Architecture of an Open-Sourced, Extensible Data Warehouse Builder: InterBase 6 Data Warehouse Builder (IB-DWB). In Rubinstein, B. I. P., Chan, N., Kshetrapalapuram, K. K. (Eds.), Proceedings of the First Australian Undergraduate Students' Computing Conference. (pp. 40-45). [Abstract] [PDF]

We report the development of an open-sourced data warehouse builder, InterBase Data Warehouse Builder (IB-DWB), based on Borland InterBase 6 Open Edition Database Server. InterBase 6 is used for its low maintenance and small footprint. IB-DWB is designed modularly and consists of 5 main components, Data Plug Platform, Discoverer Platform, Multi-Dimensional Cube Builder, and Query Supporter, bounded together by a Kernel. It is also an extensible system, made possible by the Data Plug Platform and the Discoverer Platform. Currently, extensions are only possible via dynamic linked-libraries (DLLs). Multi-Dimensional Cube Builder represents a basal mean of data aggregation. The architectural philosophy of IB-DWB centers around providing a base platform that is extensible, which is functionally supported by expansion modules. IB-DWB is currently being hosted by sourceforge.net (Project Unix Name: ib-dwb), licensed under GNU General Public License, Version 2.

[2] Ling, MHT. 2006. An Anthological Review of Research Utilizing MontyLingua, a Python-Based End-to-End Text Processor. The Python Papers 1 (1): 5-12. [Abstract] [PDF]

[3] Ling, MHT. 2007. Firebird Database Backup by Serialized Database Table Dump. The Python Papers 2 (1): 12-16. [Abstract] [PDF]

This paper presents a simple data dump and load utility for Firebird databases which mimics mysqldump in MySQL. This utility, fb_dump and fb_load, for dumping and loading respectively, retrieves each database table using kinterbasdb and serializes the data using marshal module. This utility has two advantages over the standard Firebird database backup utility, gbak. Firstly, it is able to backup and restore single database tables which might help to recover corrupted databases. Secondly, the output is in text-coded format marshal module) making it more resilient than a compressed text backup, as in the case of using gbak.

[4] Ling, MHT, Lefevre, C, Nicholas, KR, Lin, F. 2007. Re-construction of Protein-Protein Interaction Pathways by Mining Subject-Verb-Objects Intermediates. In J.C. Ragapakse, B. Schmidt, and G. Volkert (Eds.), Proceedings of the Second IAPR Workshop on Pattern Recognition in Bioinformatics (PRIB 2007). Lecture Notes in Bioinformatics 4774. (pp. 286-299) Springer-Verlag. [Abstract] [PDF]

Prior to this project, it was generally considered in the NLP field that biomedical text are domain-specific and will require a certain degree of tool adaptation from the generic-domain to be of use. Muscorian refuted this assumption by demonstrating that an un-adapted generic text processor can perform comparably to adapted tools. At the same time, the un-adapted text processor forms the generalized layer to transform unstructured text into a structured table of subject-verb-object on which question-specific tools can be built. This study also demonstrates the flexibility of this generalization-specialization paradigm by using the same generalized layer for 2 specialized questions.

[5] Ling, MHT, Lefevre, C, Nicholas, KR. 2008. Parts-of-Speech Tagger Errors Do Not Necessarily Degrade Accuracy in Extracting Information from Biomedical Text. The Python Papers 3(1): 65-80. [Abstract] [PDF]

This manuscript attempts to find out the reason why an un-adapted text processor can perform comparably to adapted tools. It was found that although an un-adapted text processor's parts-of-speech (POS) tagging accuracy is lower than specialized tools, it has minimal effect on the transformation to subject-verb-object structures due to complementary POS tag use in shallow parsing (breaking down sentences into phrases); thus, supporting our previous findings.

[6] Ling, MHT, Lefevre, C, Nicholas, KR. 2008. Filtering Microarray Correlations by Statistical Literature Analysis Yields Potential Hypotheses for Lactation Research. The Python Papers 3(3): 4. [Abstract] [PDF]

Besides NLP, statistical linguistics which depends on the appearance of words or names in text has been used to extract potential protein-protein interactions, such as in the case of PubGene and CoPub Mapper. In the case of PubGene, it was found that the presence of 2 protein names in 1 abstract out of 10 million (1-PubGene) suggest 60% likelihood of interaction and increases to 72% when the names appears 5 times or more (5-PubGene). This manuscript analyzed PubGene methods using Poisson distribution and found that 1-PubGene is generally more stringent that 99% confidence on Poisson distribution; thus, explaining 1-PubGene's expectedly good performance. This study demonstrated that NLP extracted interactions were almost a proper subset of statistical extraction, suggesting that NLP can be used to annotate statistical extractions. This study also found that a majority of co-expressed genes from microarray analysis, including 7 pairs of perfectly co-expressed genes, were not mentioned in text, suggesting that these potential interactions had not been studied experimentally. Hence, we suggest that text mining may be used to construct a "state of current knowledge" suitable to identify potential hypotheses for further experimental research.

[7] Ling, MHT. 2009. Compendium of Distributions, I: Beta, Binomial, Chi-Square, F, Gamma, Geometric, Poisson, Student's t, and Uniform. The Python Papers Source Codes 1:4. [Abstract] [PDF] [Zipped Codes]

This paper is the first of a series to implement routines to calculate statistical distributions which forms the basis of other statistical tests.

[8] Ling, MHT. 2009. Ten Z-test Routines from Gopal Kanji's 100 Statistical Tests. The Python Papers Source Codes 1:5. [Abstract] [PDF] [Zipped Codes]

This paper is the first of a series to implement statistical tests routines. For this manuscript, I chose to implement test routines from Gopal Kanji's book which uses Normal distribution, accounting for 10% of the book.

[9] Ling, MHT. 2009. Understanding Mouse Lactogenesis by Transcriptomics and Literature Analysis. Doctor of Philosophy. Department of Zoology, The University of Melbourne, Australia. [Full Text]

This thesis is advised by Professor Kevin R. Nicholas (currently in Deakin University, Australia) and co-advised Associate Professors Christophe Lefevre (currently in Deakin University, Australia) and Feng Lin (currently in Nanyang Technological University, Singapore). This thesis refuted previous assumption that generic computational linguistics processor is unable to process biomedical text due to domain-specificity and attributed it to complementary parts-of-speech tag use in the shallow parsing (breaking down sentences into phrases) process. This thesis confirmed that subject-verb-object structure is a suitable intermediate for extracting protein-protein interactions from text and demonstrated the flexibility of this technique in information extraction. This thesis demonstrated that information extraction by computational linguistics can supplement information extraction by statistical co-occurrence. Using computational and statistical information extraction, a filter representing the current state of biological knowledge was built to be used with microarray analysis for identifying potential novel hypotheses for further research. This thesis examined the relevance of mouse hormone-treated mammary tissue culture in studying mouse lactogenesis by comparing the transcriptomes of cultured tissues with in vivo mammary tissues across the lactation cycle using Affymetrix microarrays. It concluded that the tissue culture is useful in the study of primary hormonal responses but is unlikely to be useful in studying sustained responses and the tissue culture is a useful tool to “re-construct” the set of hormonal stimuli required to simulate mouse mammary tissues into lactogenesis.

[10] Kuo, CJ, Ling, MHT, Lin, KT, Hsu, CN. 2009. BIOADI: A Machine Learning Approach to Identify Abbreviations and Definitions in Biological Literature. BMC Bioinformatics 10(Suppl 15):S7. [Full Text] [PDF]

This manuscript deals with a limitation identified in my doctoral thesis - real-time identification of gene/protein names and its abbreviations in text instead of a dictionary approach used in my thesis. We identified about 1.7 million unique long form / abbreviations pairs in the entire PubMed with 95.86% precision and 89.9% recall at an average computational speed of 10.2 seconds per thousand abstracts. At the same time, BIOADI is also a standalone tool that can be incorporated into an analysis pipeline. This study also contributed an annotated corpus to the community for tool evaluation purposes.

[11] Ling, MHT, Lefevre, Christophe, Nicholas, Kevin R. 2009. Biomedical Literature Analysis: Current State and Challenges. In B.G. Kutais (ed). Internet Policies and Issues, Volume 7. Nova Science Publishers, Inc.

This manuscript reviews the central (information retrieval, information extraction and text mining) and allied (corpus collection, databases and system evaluation methods) domains of computational to present the current state of biomedical literature analysis for protein-protein and protein-gene interactions and challenges ahead - Firstly, biomedical text mining is highly dependent in PubMed (MedLine) as text repository but neither the implementation details nor performance is terms of precision and recall is known. Secondly, extraction of interactions depends on the recognition of entity (protein and gene) names in text and whether different names refers to the same protein remains an open problem. Thirdly, extraction of interactions by co-occurrence and NLP has been shown to be complementary suggesting the improvement of future systems in this direction. Fourthly, evidence suggests that generic NLP engines may be able to process text for interaction extractions due to complementary POS tag use in shallow parsing process but more extensive evaluations are needed. Fifthly, there is a shortage of suitable corpora for system evaluation resulting in difficulty in comparison (due to different corpus or databases used in evaluation) prompting the collection of a common set of corpora for communal use. Lastly, biomedical literature analysis tools must demonstrate real world applications without a steep learning curve before the slow adoption of these tools by biologists (the intended users) can be reversed.

[12] Lee, CH, Lee, KC, Oon, JSH, Ling, MHT. 2010. Bactome, I: Python in DNA Fingerprinting. In: Peer-Reviewed Articles from PyCon Asia-Pacific 2010. The Python Papers 5(3): 6. [Abstract] [PDF] [Zipped Codes]

Bactome is a set of functions created for our analysis of DNA fingerprints. This includes functions to find suitable primers for PCR-based DNA fingerprinting given a known genome, determine restriction digestion profile, and analyse the resulting DNA fingerprint features as migration distance of the bands in gel electrophoresis.

[13] Ng, YY and Ling, MHT. 2010. Electronic Laboratory Notebook on Web2Py Framework. In: Peer-Reviewed Articles from PyCon Asia-Pacific 2010. The Python Papers 5(3): 7. [Abstract] [PDF]

This paper presents CyNote version 1.4 as a prototype of an electronic laboratory notebook that is built on Web2py framework. CyNote uses a blog-style structure (entries and comments) as laboratory notebook and had implemented a number of bioinformatics and statistical analysis functions. At the same time, this paper evaluates CyNote against US FDA 21 CFR Part 11.

[14] Ling, MHT. 2010. COPADS, I: Distances Measures between Two Lists or Sets. The Python Papers Source Codes 2:2. [Abstract] [PDF] [Zipped Codes]

This paper implements 35 distance coefficients with worked examples: Jaccard, Dice, Sokal and Michener, Matching, Anderberg, Ochiai, Ochiai 2, First Kulcsynski, Second Kulcsynski, Forbes, Hamann, Simpson, Russel and Rao, Roger and Tanimoto, Sokal and Sneath, Sokal and Sneath 2, Sokal and Sneath 3, Buser, Fossum, Yule Q, Yule Y, McConnaughey, Stiles,Pearson, Dennis, Gower and Legendre, Tulloss, Hamming, Euclidean, Minkowski, Manhattan, Canberra, Complement Bray and Curtis, Cosine, Tanimoto.

[15] Chay, ZE, Ling, MHT. 2010. COPADS, II: Chi-Square test, F-Test and t-Test Routines from Gopal Kanji's 100 Statistical Tests. The Python Papers Source Codes 2:3. [Abstract] [PDF] [Zipped Codes]

This paper extends previous work on the implementation of statistical tests as described by Kanji. A total of 8 Chi-square tests, 3 F-tests and 6 t-tests routines are implemented, bringing a total of 27 out of 100 tests implemented to date.

[16] Lim, JZR, Aw, ZQ, Goh, DJW, How, JA, Low, SXZ, Loo, BZL, Ling, MHT. 2010. A genetic algorithm framework grounded in biology. The Python Papers Source Codes 2: 6. [Abstract] [PDF] [Zipped Codes]

This manuscript describes the implementation of a GA framework that uses biological hierarchy - from chromosomes to organisms to population.

[17] Tahat, A, Ling, MHT. 2010. Mapping Relational Operations onto a Hypergraph Model. The Python Papers 6(1): 4. [Abstract] [PDF]

The relational model is the most commonly used data model for storing large datasets. However, many real world objects are recursive and associative in nature which makes storage in the relational model difficult. The hypergraph model is a generalization of a graph model, where each hypernode can be made up of other nodes or graphs and each hyperedge can be made up of one or more edges. It may address the recursive and associative limitations of relational model. However, the hypergraph model is non-tabular; thus, loses the simplicity of the relational model. In this study, we consider the means to convert a relational model into a hypergraph model in two layers and present a reference implementation of relational operators (project, rename, select, inner join, natural join, left join, right join, outer join and Cartesian join) on a hypergraph model.

[18] Ling, MHT. 2010. Specifying the Behaviour of Python Programs: Language and Basic Examples. The Python Papers 5(2): 4. [Abstract] [PDF]

This manuscript describe BeSSY, a function-centric language for formal behavioural specification that requires no more than high-school mathematics on arithmetic, functions, Boolean algebra and sets theory. An object can be modelled as a union of data sets and functions whereas inherited object can be modelled as a union of supersets and a set of object-specific functions. Python list and dictionary operations are specified in BeSSY for illustration.

[19] Ling, MHT, Lefevre, Christophe, Nicholas, KR. 2010. Mining Protein-Protein Interactions from Published Abstracts with MontyLingua. In Zhongming Zhao(ed). Sequence and Genome Analysis: Methods and Applications. iConcept Press Pty Ltd. [PDF]

[20] Ling, MHT. 2011. Bactome II: Analyzing Gene List for Gene Ontology Over-Representation. The Python Papers Source Codes 3: 3. [Abstract] [PDF] [Zipped Codes]

Microarray is an experimental tool that allows for the screening of several thousand genes in a single experiment and the analysis of which often requires mapping onto biological processes. This allows for the examination of processes that are over-represented. A number of tools have been developed but each differed in terms of organisms that can be analyzed. Gene Ontology website has a list of up-to-date annotation files for different organisms that can be used for over-representation analysis. Each file maps each gene of the organism to its ontological terms. It is a simple tool that allows users to use the up-to-date annotation files to generate the expected and observed counts for each GO identifier (GO ID) from a given gene list for further statistical analyses.

[21] Kuo, CJ, Ling, MHT, Hsu, CN. 2011. Soft Tagging of Overlapping High Confidence Gene Mention Variants for Cross-Species Full-Text Gene Normalization. BMC Bioinformatics 12(Suppl 8):S6. [Full Text]

Background: Previously, gene normalization (GN) systems are mostly focused on disambiguation using contextual information. An effective gene mention tagger is deemed unnecessary because the subsequent steps will filter out false positives and high recall is sufficient. However, unlike similar tasks in the past BioCreative challenges, the BioCreative III GN task is particularly challenging because it is not species-specific. Required to process full-length articles, an ineffective gene mention tagger may produce a huge number of ambiguous false positives that overwhelm subsequent filtering steps while still missing many true positives. Results: We present our GN system participated in the BioCreative III GN task. Our system applies a typical 2-stage approach to GN but features a soft tagging gene mention tagger that generates a set of overlapping gene mention variants with a nearly perfect recall. The overlapping gene mention variants increase the chance of precise match in the dictionary and alleviate the need of disambiguation. Our GN system achieved a precision of 0.9 (F-score 0.63) on the BioCreative III GN test corpus with the silver annotation of 507 articles. Its TAP-k scores are competitive to the best results among all participants. Conclusions: We show that despite the lack of clever disambiguation in our gene normalization system, effective soft tagging of gene mention variants can indeed contribute to performance in cross-species and full-text gene normalization.

[22] Ling, MHT, Jean, A, Liao, D, Tew, BBY, Ho, S, Clancy, K. 2011. Integration of Standardized Cloning Methodologies and Sequence Handling to Support Synthetic Biology Studies. Third International Workshop on Bio-Design Automation (IWBDA). San Diego, California, USA.

[23] Ling, MHT. 2012. An Artificial Life Simulation Library Based on Genetic Algorithm, 3-Character Genetic Code and Biological Hierarchy. The Python Papers 7: 5. [Abstract] [PDF]

Genetic algorithm (GA) is inspired by biological evolution of genetic organisms by optimizing the genotypic combinations encoded within each individual with the help of evolutionary operators, suggesting that GA may be a suitable model for studying real-life evolutionary processes. This paper describes the design of a Python library for artificial life simulation, Digital Organism Simulation Environment (DOSE), based on GA and biological hierarchy starting from genetic sequence to population. A 3-character instruction set that does not take any operand is introduced as genetic code for digital organism. This mimics the 3-nucleotide codon structure in naturally occurring DNA. In addition, the context of a 3-dimensional world composing of ecological cells is introduced to simulate a physical ecosystem. Using DOSE, an experiment to examine the changes in genetic sequences with respect to mutation rates is presented.

[24] Ling, MHT. 2012. Ragaraja 1.0: The Genome Interpreter of Digital Organism Simulation Environment (DOSE). The Python Papers Source Codes 4: 2. [Abstract] [PDF] [Zipped Codes]

This manuscript describes the implementation and test of Ragaraja instruction set version 1.0, which is the core genomic interpreter of DOSE.

[25] Chen, KFQ, Ling, MHT. 2013. COPADS III (Compendium of Distributions II): Cauchy, Cosine, Exponential, Hypergeometric, Logarithmic, Semicircular, Triangular, and Weibull. The Python Papers Source Codes 5: 2. [Abstract] [PDF] [Zipped Codes]

This manuscript illustrates the implementation and testing of eight statistical distributions, namely Cauchy, Cosine, Exponential, Hypergeometric, Logarithmic, Semicircular, Triangular, and Weibull distribution, where each distribution consists of three common functions – Probability Density Function (PDF), Cumulative Density Function (CDF) and the inverse of CDF (inverseCDF). These codes had been incorporated into COPADS codebase (https://github.com/copads/copads) are licensed under Lesser General Public Licence version 3.

[26] Ling, MHT. 2014. NotaLogger: Notarization Code Generator and Logging Service. The Python Papers 9: 2. [Abstract] [PDF] [Web Application]

The act of affixing a signature and date to a document, known as notarization, is often used as evidence for sighting or bearing witness to any documents in question. Notarization and dating are required to render documents admissible in the court of law. However, the weakest link in the process of notarization is the notary; that is, the person dating and affixing his/her signature. A number of legal cases had shown instances of false dating and falsification of signatures. In this study, NotaLogger is proposed, which can be used to generate a notarization code to be appended to the document to be notarized. During notarization code generation, the user can include relevant information to identify the document to be notarized and the date and time of code generation will be logged into the system. Generated and used notarization code can be verified by searching in NotaLogger, and such search will result in date time stamping by a Network Time Protocol server. As a result, NotaLogger can be used as an "independent witness" to any notarizations. NotaLogger can be accessed at http://mauricelab.pythonanywhere.com/notalogger/.

[27] Chan, OYW, Keng, BMH, Ling, MHT. 2014. Bactome III: OLIgonucleotide Variable Expression Ranker (OLIVER) 1.0, Tool for Identifying Suitable Reference (Invariant) Genes from Large Microarray Datasets. The Python Papers Source Codes 6: 2. [PDF] [Zipped Codes] [Application Download]

This manuscript documents the implemetation for OLIgonucleotide Variable Expression Ranker (OLIVER) as described in Chan et al. (2014), which can be downloaded from http://sourceforge.net/projects/bactome/files/OLIVER/OLIVER_1.zip. These codes are licensed under GNU General Public License version 3 for academic and non-for-profit use.

[28] Castillo, CFG, Ling, MHT. 2014. Digital Organism Simulation Environment (DOSE): A Library for Ecologically-Based In Silico Experimental Evolution. Advances in Computer Science: an International Journal 3(1): 44-50. [Abstract] [PDF]

Testing evolutionary hypothesis in biological setting is expensive and time consuming. Computer simulations of organisms (digital organisms) are commonly used proxies to study evolutionary processes. A number of digital organism simulators have been developed but are deficient in biological and ecological parallels. In this study, we present DOSE (Digital Organism Simulation Environment), a digital organism simulator with biological and ecological parallels. DOSE consists of a biological hierarchy of genetic sequences, organism, population, and ecosystem. A 3-character instruction set that does not take any operand is used as genetic code for digital organism, which the 3-nucleotide codon structure in naturally occurring DNA. The evolutionary driver is simulated by a genetic algorithm. We demonstrate the utility in examining the effects of migration on heterozygosity, also known as local genetic distance. Our simulation results showed that adjacent migration, such as foraging or nomadic behaviour, increases heterozygosity while long distance migration, such as flight covering the entire ecosystem, does not increase heterozygosity.

[29] Koh, YZ, Ling, MHT. 2014. Catalog of Biological and Biomedical Databases Published in 2013. iConcept Journal of Computational and Mathematical Biology 3: 3. [PDF]

[30] Ling, MHT. 2016. COPADS IV: Fixed Time-Step ODE Solvers for a System of Equations Implemented as a Set of Python Functions. Advances in Computer Science: an international journal 5(3): 5-11. [Abstract] [PDF]

Ordinary differential equation (ODE) systems are commonly used many different fields. The de-facto method to implement an ODE system in Python programming using SciPy requires the entire system to be implemented as a single function, which only allow for inline documentation. Although each equation can be broken up into sub-equations, there is no compart-mentalization of sub-equations to its ODE. A better method will be to implement each ODE as a function. This encapsulates the sub-equations to its ODE, and allow for function and inline documentation, resulting in better maintainability. This study presents the implementation 11 ODE solvers that enable each ODE in a system to be implemented as a function. Three enhancements will be added. Firstly, the solvers will be implemented as generators to allow for virtually infinite simulation and returning a stream of intermediate results for analysis. Secondly, the solvers will allow for non-ODE-bounded variables or solution vector to improve code and results documentation. Lastly, a means to set upper and lower boundary of ODE solutions will be added. Validation testing shows that the enhanced ODE solvers give comparable results to SciPy’s default ODE solver. The implemented solvers are incorporated into COPADS repository (https://github.com/copads/copads).

[31] Chew, JS, Ling, MHT. 2016. TAPPS Release 1: Plugin-Extensible Platform for Technical Analysis and Applied Statistics. Advances in Computer Science: an international journal 5(1): 132-141. [Abstract] [PDF]

In this first article, the main features of TAPPS were described: (1) a thin platform with (2) a CLI-based, domain-specific command language where (3) all analytical functions are implemented as plugins. This results in a defined plugin system, which enables rapid prototyping and testing of analysis functions. This article also describes the architecture and implementation of TAPPS in a level of detail sufficient for interested developers to fork the code for further improvements.

[32] Chay, ZE, Goh, BF, Ling, MHT. 2016. PNet: A Python Library for Petri Net Modeling and Simulation. Advances in Computer Science: an international journal 5(4): 24-30.

Clone this wiki locally