FASTERp: A Feature Array Search Tool for Estimating Resemblance of Protein Sequences
Derek Macklin, Stanford University
Metagenome sequencing efforts have provided a large pool of billions of genes for identifying enzymes with desirable biochemical traits. However, homology search with billions of genes in a rapidly growing database has become increasingly computationally impractical. Here we present our pilot efforts to develop a novel alignment-free algorithm for homology search. Specifically, we represent individual proteins as feature vectors that denote the presence or absence of short kmers in the protein sequence. Similarity between feature vectors is then computed using the Tanimoto score, a distance metric that can be rapidly computed on bit string representations of feature vectors. Preliminary results indicate good correlation with optimal alignment algorithms (Spearman r of 0.87, about 1 million proteins from Pfam), as well as with heuristic algorithms such as BLAST (Spearman r of 0.86, about 1 million proteins). Furthermore, a prototype of FASTERp implemented in Python runs approximately four times faster than BLAST on a small-scale data set (about 1,000 proteins). We are optimizing and scaling to improve FASTERp to enable rapid homology searches against billion-protein databases, thereby enabling more comprehensive gene annotation efforts.
Abstract Author(s): Derek Macklin, Rob Egan, Zhong Wang