How to Approach Genetic Sequencing Databases:  A Primer.

Lily Wang, Ph.D.


Vanderbilt Kennedy Center

Statistics and Methodology Core

Training Series



Over the past decade, with the advances of rapid sequencing technology and the completion of sequencing chromosomes of model organisms such as E coli, mouse, fruit fly and human, enormous amount of DNA and protein sequence data has accumulated in genetic databases. With the massive amount of sequence data, the important goals of modern molecular biology are to understand the structure, function and evolutionary relationships of the genes and proteins. However, experimental methods to achieve these goals are time-consuming and expensive and therefore will not keep pace with the fast growth of seqeunce databanks. Regions of protein sequences that do not change as much as the rest during evolution often suggest similarities in structure, function and relationships in phylogeny. Sequence comparison methods following the theory that similar sequence implies similar structure has been effective tools for understanding properties of biological molecules.


This talk aims to provide researchers with an overview of the major genetic (nucleotide and protein) sequence databanks and major software tools for sequence alignment and database searching.  In addition, I will give an introduction to the statistical and computational issues involved in sequence comparison methods.



1. An overview of major sequence databases

2. Why sequence comparison methods are effective tools for understanding properties of proteins.

3. Methods for alignment of sequences and statistical distributions of alignment scores.

4. Commonly used software tools and general guidelines for conducting sequence analysis.

5. Further research in this area.



1. To give an overview of major sequence databases and software tools for database searching.

2. To introduce biological background of sequence analysis and to identify statistical and computational issues involved.

3. To introduce algorithms of sequence alignment and how to use statistical distribution to assess significance of alignment scores.


Intended Audience:

Researchers who are currently using genetic sequencing databases.


Researchers who might be interested in exploring the relationship between their current research areas and genetics, in particular biological sequence analysis.


Speaker Description:

Lily Wang, PhD, Assistant professor in Biostatistics, is a statistical advisor of the Quantitative Core. Wang has a doctorate in Biostatistics from the University of North Carolina at Chapel Hill. Before joining Vanderbilt in 2004, she has worked in the Biometric Consulting Lab at UNC under the direction of Dr Gary Koch for six years where she collaborated with researchers from medical, public health and social science. Her thesis focused on developing statistical methods for the approximation of protein sequence alignment scores distribution for database searching in bioinformatics using asymptotic theory as well as Bayesian methods. In addition, Wang maintains a broad interest in biostatistics and, in particular the analysis of longitudinal and correlated data using Generalized Estimating Equation and Mixed Models.