Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 82 A similarity search of a database takes a relatively short query sequence of a protein or DNA fragment and searches every entry in the database for evidence of similarity with the query. In protein databases, the query sequence and the entries in the database are typically of similar sizes. In DNA databases, the entries are typically much longer than the query sequence, and one is looking for subsegments of the entry that match the query. Heuristic Algorithms The problem of searching for protein similarities efficiently has led many investigators to abandon dynamic programming algorithms (for which the size of the problem has become too large) and instead consider designing very fast heuristic procedures: simple, often ad hoc, computational procedures that produce answers that are "nearly" correct with respect to a formally stated optimization criterion. One of the most popular database searching tools of this genre is FASTA (Lipman and Pearson, 1985). FASTA looks for entries that share a significant number of short identical subsequences of symbols with the query sequence. Any entry meeting this criterion is then compared via dynamic programming with the query sequence. In this way, the vast majority of entries are eliminated from consideration quickly. FASTA reports most of the alignments that would be identified by an equivalent dynamic programming calculation, but it misses some matches and also reports some spurious matches. On the other hand, FASTA is very fast. A more recently developed heuristic algorithm is BLASTA (Altschul et al., 1990). BLASTA is faster than FASTA but is capable of detecting biologically meaningful similarities with accuracy comparable to that of FASTA. Given a query A and an entry B, BLASTA searches for segment pairs of high score. A segment pair is a substring from A and a substring from B of equal length, and the score of the pair is that of the no-gap alignment between them. One can argue that the presence of a high-scoring segment pair or pairs is evidence of functional similarity between proteins, because insertion and deletion events tend to significantly change the shape of a protein and hence its function. Note that segment pairs embody a local similarity concept. What is particularly useful is that there is a formula for the probability that two sequences have
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 83 a segment pair above a certain score. Thus BLASTA can give an assessment of the statistical significance of any match that it reports. For a given threshold, T, BLASTA returns to the user all database entries that have a segment pair with the query of score greater than T ranked according to probability. BLASTA may miss some such matches, although in practice it misses very few. The central idea used in BLASTA is the notion of a neighborhood. The t-neighborhood of a sequence S is the set of all sequences that align with S with score better than t. In the case of BLASTA, the t -neighborhood of S is exactly those sequences of equal length that form a segment pair of score higher than t under the Dayhoff scoring scheme (see Figure 3.5). This concept suggests a simple strategy for finding all entries that have segment pairs of length k and score greater than t with the query: generate the set of all sequences that are in the t -neighborhood of some k-substring of the query and see if an entry contains one of these strings. Scanning for an exact match to one of the strings in the neighborhood can be performed very efficiently: on the order of 0.5 million characters per second on a 20 SPECint computer. Unfortunately, for the general problem, the length of the segment pair is not known in advance, and even more devastating is the fact that the number of sequences in a neighborhood grows exponentially in both k and t, rendering it impractical for reasonable values of t. To circumvent this difficulty, BLASTA uses the fast scanning strategy above to find short segment pairs of length k above a score t, and then checks each of these to see if they are a portion of a segment pair of score T or greater. This approach is heuristic (that is, may miss some segment pairs) because it is possible for every length k subsegment pair of a segment pair of score T to have score less than t. Nonetheless, with k = 4 and t = 17 such misses are very rare, and BLASTA takes about 3 seconds for every 1 million characters of data searched. To get an idea of the relative efficiency of various similarity searching approaches, consider the following rough timing estimates for a typical 20 SPECint workstation and a search against a typical protein query. The dynamic programming algorithm for local similarities presented above (also known as the Smith-Waterman algorithm) takes roughly 1000.0N microseconds to search a database with a total of N characters in it. On the other hand, FASTA takes 20.0N microseconds, and BLASTA only about 2.0N microseconds. At the other end of the spectrum, the systolic array