A system for determining the statistical significance of the frequency of short DNA motif matches in a genome an analytical approach

M.S. in Computer Science


Department of Computer Science


Advisor: Sudhindra Gadagkar

Advisor: Jenifer Seitzer


A problem in biology arises in the evaluation of statistical significance of the observed frequency of candidate transcription factor binding site matches (To) in a genome. This is because possible overlaps in the genome render the usual chi-square test unsuitable. In this study, we develop generalized models for evaluating the expectation and variance of T over a variety of probability spaces of randomly occurring sequences of elements (or symbols), which can then be used to perform a Z test. In addition, a software toolset in Java was developed to implement basic tools for manipulating molecular sequences along with code for implementing the discovery algorithm and the statistical tools for each of the probability models considered. These Sequence tools are then included in a proposed design to develop a workbench to discover sequence motifs in a genome.


Nucleotide sequence Statistical methods Computer simulation, Genomes Models Computer simulation, DNA Models Computer simulation

Copyright 2011, author