List of Software
- Variant Interpretation for Cancer (VIC)
- SeqHBase: a big data toolset for family-based sequencing data analysis
- SparkText: an efficient toolset for data mining large-scale scientific literature
- HadoopCNV
- A User-Friendly Software Tool for Population Stratification Adjustment in Genome-Wide Association Studies
- Association Tests for Annotated Variants (ATAV)
- Statistical Analysis of Antigen Receptor Spectratype Data
- OmicShare
- Variant Interpretation for Cancer (VIC)
[top]
VIC is a computational tool for assessing clinical impacts of somatic variants following the AMP-ASCO-CAP 2017 guidelines. We developed VIC to accelerate the interpretation process and minimize individual biases. VIC takes pre-annotated files and automatically classifies sequence variants based on several criteria, with the ability for users to integrate additional evidence to optimize the interpretation on clinical impacts. We evaluated VIC using several publicly available databases and compared with several predictive software programs. We found that VIC is time-efficient and conservative in classifying somatic variants under default settings, especially for variants with strong and/or potential clinical significance. Additionally, we tested VIC on two cancer-panel sequencing datasets to show its effectiveness in facilitating manual interpretation of somatic variants. VIC can also be customized by clinical laboratories to fit into their analytical pipelines to facilitate the laborious process of somatic variant interpretation. Its paper has been published in Genome Medicine, 2019 Aug 23;11(1):53. VIC is freely available at https://github.com/HGLab/VIC/.
- SeqHBase: a big data toolset for family-based sequencing data analysis
[top]
SeqHBase is a big data toolset developed based on Apache Hadoop and HBase infrastructure. It is designed for analyzing family-based sequencing data to detect de novo, inherited homozygous or compound heterozygous mutations. SeqHBase takes as input BAM files (for coverage of 3 billion sites of a genome), VCF files (for variant calls) and functional annotations (for variant prioritization). SeqHBase works through distributed and completely parallel manner over multiple data nodes. We applied SeqHBase to a 5-member nuclear family and a 10-member three-generation family with whole genome sequencing (WGS) data, as well as a 4-member nuclear family with whole exome sequencing (WES) data. Analysis times were linearly scalable with the number of data nodes. With 20 data nodes, SeqHBase took about 5 seconds for analyzing WES familial data and approximately 1 minute for analyzing the 10-member WGS familial data. These results demonstrated SeqHBase’s high efficiency and scalability. In addition, it is distributed, customizable, and scalable based on the needs with available data volume. As more data become available, addition of more data nodes is possible, making the system very nimble. The newly added data nodes can be seamlessly incorporated with the existing system. SeqHBase can be applied to manipulate and analyze millions of WGS data.
- SparkText: an efficient toolset for data mining large-scale scientific literature
[top]
Text mining is a specialized data mining method that extracts information (e.g. facts, biological processes, or diseases) from text, such as scientific literature. We utilized natural language processing (NLP), machine learning strategies, and Big Data infrastructure to design and develop a distributed and scalable framework to extract information, such as breast, prostate, and/or lung cancers, and then to develop prediction models to classify information extracted from more than 29,437 full-text articles downloaded from PubMed Central. We employed three different classification algorithms, including Naive Bayes, Support Vector Machine (SVM), and Logistic Regression, to build a prediction model using 5-fold cross validation on the 29,437 full-text articles. The framework was developed on a Big Data infrastructure, including an Apache Hadoop cluster, together with Apache Spark component and Cassandra Database. The run time required when using Big Data platform to mine more than 29,437 full-text articles was about 6 minutes, while it took more than 11 hours without using any Big Data infrastructure. It showed that mining large-scale biomedical articles on a Big Data infrastructure can be significantly accelerated. Accuracy, precision, or recall of predicting a cancer type using any of the three machine learning methods on 29,437 full-text articles was compatible or better than the one using other libraries, such as Weka library and TagHelper Tools. Both the time efficiency and accuracy of our scalable framework were promising and this strategy will provide tangible benefits to medical research.
- HadoopCNV-Collaborative development of HadoopCNV with Dr. Kai Wang at USC
[top]
HadoopCNV is a highly scalable solution for accurate detection of copy number variations (CNVs) from WGS data. It infers interesting aberration events, such as copy number changes and loss of heterozygosity (LOH), through information encoded in both allelic and overall read depth. In particular, resolving small regions in samples with deep coverage can be very time consuming due to massive I/O cost. Our implementation is built on the Hadoop MapReduce paradigm, enabling multiple processors to efficiently process separate regions in tandem. We employed a Viterbi scoring algorithm to infer the most likely copy number/heterozygosity state for each region of the genome. We applied HadoopCNV to a 10 member pedigree sequenced by Illumina HiSeq. Our method has a Mendelian inconsistency that is overall lower than other competing approaches. Our method also has comparable performance with the NA12878 individual from the 1000 Genomes Project. Most importantly, our method only takes 1.3 hours from BAM files to CNV output, while other methods take more than 13 hours.
- Association Tests for Annotated Variants (ATAV)
[top]
ATAV is a statistical toolset that is designed to detect complex disease-associated rare genetic variants by performing association analysis, trio analysis, and/or linkage analysis on whole-genome or whole-exome sequencing data.
- A User-Friendly Software Tool for Population Stratification Adjustment in Genome-Wide Association Studies
[top]
Population stratification is characterized by systematic differences in allele frequencies between sub-populations. If differences in disease burden between sub-populations are also present, population stratification can result in false-positive associations between the disease and genetic variants. The “stratification score” approach of Epstein, Allen, and Satten has been proposed to address this problem. The basic idea is to develop strata within which individuals have similar baseline probabilities of disease conditional on genomic information. Stratified association tests using these strata have been shown to have both the correct type I error rate and good power. Here we present a user friendly software tool that implements the “stratification score” and is able to handle genome-wide association data. The tool allows users to import data in many popular data formats and performs several other useful functions including the calculation and visualization of principal components. Both Web-based and standalone versions of the tool are implemented. The Web-based tool allows research groups to operate under a client/server model in which users are able to interact with the tool remotely, getting results via email if they wish.
- Statistical Analysis of Antigen Receptor Spectratype Data
[top]
Spectratype analysis (SpA) is a method used in clinical and basic immunological settings in which antigen receptor length diversity is assessed as a surrogate for functional diversity. We have developed the statistical methods appropriate for the comparison of multiple different spectratypes in a variety of ways. The fundamental statistic for these comparisons and statistical tests is the completeness, an information-theoretic quantity that arises naturally in the statistical derivations. The completeness is closely related to the entropy as a measure of the diversity of the antigen receptor repertoire and serves as a sensitive and objective measure of the state of the repertoire. Several of the statistical tests based on the completeness are performed automatically upon data submission, and additional tests are available to the user online through SpA. Specialized statistical tools, developed for hypothesis testing and modeling for multiple spectratypes, are also available through the SpA interface. In addition to the specific procedures provided by SpA, the powerful, general-purpose data analysis package R is integrated into SpA system for more specialized procedures (Bioinformatics, 2005, 21, 3394-3400; Bioinformatics, 2005, 21, 3697-3699). It is used both on campus and throughout the world.
- OmicShare
[top]
OmicShare is a collaborative work environment that enables users to easily store, manage and share all types of instrumental and analytical data files for project management in biomedical research. It facilitates research collaboration and reduces the risk of data loss. OmicShare has a user friendly interface accessed through an Internet browser. Data files are uploaded to the system underlying a robust database (The database can be any one of the relational databases, such as Oracle, MySQL, PostgreSQL, etc.) by selecting, coping, or simple drag-and-drop files. OmicShare allows users to upload/download multiple subfolders and files by a simple click. Folders or files can be granted different permissions to other collaborators by the data supplier or system administrator. OmicShare allows users to share files with collaborators quickly, easily, and professionally. Users can securely and quickly navigate to the projects in which they are involved to communicate with other collaborators inside and outside their organizations, upload/download single or multiple data file(s) by one click, as well as download analyses. Click here to evalute the software.