Molecular Biology

Protein and Nucleic Acid Sequence Analysis

Advances in biotechnology have produced massive volumes of biological sequences and their associated biological and bibliographical information. For example, the current release (87.0) of GenBank from NLM's National Center for Biotechnology Information (NCBI) contains 269,478 sequences from over 11,000 different species and consisting of more than 248 million nucleic acid bases. Researchers who discover new sequences search the database for other sequences that are similar or otherwise relevant to their studies. The biological information associated with similar sequences may provide clues to the structure and function of the newly discovered sequences. Integrated structural modeling using all known sequences and structures in conjunction with powerful computational and visualization tools available this year at NCBI will further enhance this discovery process. NCBI's information systems containing data related to molecular biology, biochemistry, and genetics now handle 20,000 searches daily over the Internet.

http://www.ncbi.nlm.nih.gov/

To find similar sequences, the query sequence is compared to every other sequence in the database. The computationally intensive dynamic programming algorithm for comparing two sequences developed by Smith and Waterman has been commonly used. The execution time for this algorithm grows exponentially with the size of the database, making it impractical on conventional sequential computing systems. DCRT has developed a new parallel algorithm that greatly reduces the search time. Taking advantage of the fact that the comparison of the query sequence with each database sequence can be done independently and in parallel, this "static" algorithm sorts the sequences in the database by decreasing length, assigns each sequence in this sorted list to that "bucket" whose contents have the smallest total sequence length, and parcels out the buckets to the processors. Given a query sequence with 50 bases, execution time was 2.3 minutes on a 128-processor Intel iPSC/860 versus 3.3 hours with just one processor. This algorithm was more than twice as fast as a previously developed dynamic load balancing algorithm.

Protein Folding Prediction

The protein folding prediction problem is to determine the three- dimensional structure of a protein molecule given only its amino acid sequence. Understanding this three-dimensional structure is needed in studying the function of proteins in living systems and designing new ones for biological and medical purposes. Amino acid sequences of proteins are being discovered at an explosive rate, but experimental procedures for determining their three-dimensional structure (for example, x-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy) are slow, costly, and complex. Theoretical and computational methods can help by predicting structure from sequence.

One reason why this problem is unsolved is that the biochemical rules governing the folding and the stability of proteins are not known. If they were known, an algorithm simulating the folding could be implemented. One alternative available today is an algorithm that finds the "best" conformation by searching all possible conformations; however because this space of all possible conformations is much too large to search, the search is restricted to a lattice. The restrictions in this lattice-space Monte Carlo method introduce distortions and consequently imperfect results.

DCRT is developing parallel algorithms so that more conformations can be searched and a more realistic energy function can be computed. These algorithms implement computationally intensive strategies for searching a large number of structures and calculating each one's free energy, particularly the hydrophobic potential that is proportional to the solvent-accessible surface area (ASA) of the protein molecule. The dihedral angle space Monte Carlo (DASMOC) algorithm combines knowledge-based rules and global minimization of free energy to search a large and complex energy space in order to find a protein structure in the most stable energy state. Current joint DCRT and NIH/NCI activities include implementing DASMOC on DCRT's Intel and developing parallel search algorithms that can work with DASMOC; DCRT is also developing parallel methods for ASA calculations.

Protein folding can generally be considered to occur in three phases or structures: a primary structure in which amino acids are linked together end-to-end according to its sequence; a secondary structure in which helices and beta sheets are formed; and a tertiary structure that is the final functional form of the protein. The secondary structure of proteins up to a certain size can be determined using NMR. NIH/NCRR has supported the development of a new parallel algorithm that uses the secondary structure obtained from NMR experiments and amino acid sequence information to calculate a lowest energy structure, which is the predicted tertiary (stable final form) structure. This algorithm has been tested, and the predicted results are accurate for a number of proteins whose structure is known.

Ribonucleic Acid (RNA) Structure Prediction

The ribonucleic acid (RNA) molecule serves several roles in nature. It is chemically very similar to DNA, but the differences are enough to distinguish its activities from the DNA molecule. The molecule is a key component for protein synthesis, serving as a template for protein generation (messenger RNA or mRNA) and being part of associated molecules for the synthesis of the proteins (ribosomal RNA or rRNA, and transfer RNA or tRNA). Many viral infections are also caused by RNA molecules, including the common cold, human immunodeficiency virus (HIV), and polio. The mechanisms for infection may vary. Some may be used like mRNA while others, like HIV, actually go through a reverse transcription process into the DNA. The viral packaging usually carries the virus' viability.

NCI has developed a new experimental "genetic algorithm" for RNA folding. This algorithm is highly parallelizable and rapidly convergent to solutions in a large conformational search space of RNA structures. It borrows from the processes of biological evolution using operations such as mutation, recombination and reproduction, and a selection criterion based on the idea of the survival of the fittest. The algorithm has been designed to run on the MasPar MP-2 massively parallel computer. It computes 16,384 RNA conformations at each generation. Using a random but structured information exchange at each generation allows it to iterate towards an optimal solution.

NCI also developed a parallel dynamic programming algorithm that generates suboptimal solutions for RNA folding. The program on the MasPar MP-2 is capable of rapidly folding RNA sequences that are over 9,000 nucleotides in length. HIV and the common cold virus are some examples of large sequences that have been folded with this algorithm.

http://www-lmmb.ncifcrf.gov