Methods and Materials

My investigation of isocitrate dehydrogenase (IDH) and caspase 3 involved a few publicly available software packages (BLAST, ScanPROSITE, PHI-BLAST, PSI-BLAST, HMMER) and one program that I wrote myself, Divide-and-BLAST. Most of these tools require amino acid sequences for the proteins being investigated, and I obtained these sequences from the protein database at the National Center for Biotechnology Information, using the Entrez search tool. In order to enable future researchers to better access these tools to investigate other proteins, I have compiled a set of directions that one can follow to use these tools for the purpose of examining the evolution of proteins.

The input sequences used for IDH1 (cytosolic IDH, NADP+ dependent) were from the following organisms: Homo sapiens (human - gi|6647551), Mus musculus (mouse - gi|6647554), Saccharomyces cerevisiae (yeast - gi|1708403), Arabidopsis thaliana (gi|4585978) and Escherichia coli (gi|124171). For the caspase 3 study, I used the protein sequence from Xenopus laevis (African clawed frog - gi|2493528, Gallus gallus (chicken - gi|3450875), Rattus norvegicus (Norwegian rat - gi|1004371), Homo sapiens (gi|4757912) and Mus musculus (gi|4757912).


BLAST (Basic Local Alignment Search Tool) is a program that looks for similarities between a given query sequence and a database of known sequences. When given a protein sequence, BLAST returns a list of proteins that are similar to it. BLAST can be used with both nucleotide sequences and protein sequences (Altschul et al., 1990).

I used BLAST with IDH and caspase 3 to get an initial idea of any homologous proteins in other organisms, as well as other related protein families.


ScanPROSITE is a program that searches a given protein sequence for the presence of known conserved amino acid patterns, stored in the PROSITE database (SIB, 2000). It provides a quick way of finding known consensus sequences in a given protein sequence.

I ran the ScanPROSITE program with IDH and caspase 3 as input sequences to find any conserved patterns. The output of PROSITE includes a "signature" for the patterns found, and I used these patterns to initiate PHI-BLAST searches, as explained below.


PHI-BLAST (Pattern Hit Initiated BLAST) takes a protein sequence and an amino acid pattern as input. It searches for the presence of the pattern in the protein sequence, and then forms a search matrix based on the protein sequence around where the pattern match occurs. The search matrix can be thought of as a multidimensional consensus sequence. The search matrix is then used to search the NCBI (National Center for Biotechnology Information) database of known protein sequences for similar pattern-protein matches (Zheng et al., 1997).

The pattern that PHI-BLAST uses as input needs to be in PROSITE format. Therefore, I used ScanPROSITE to find the correct patterns to initiate the PHI-BLAST search for the proteins I was investigating.


PSI-BLAST (Position Specific Iterated BLAST) is identical to BLAST the first time it is run; it generates a list of matches for the input protein sequence from the NCBI database. The user can then choose certain matches to be used to form a search matrix, which is used in the next iteration of PSI-BLAST (Altschul et al., 1997).

I used PSI-BLAST to look for regions of my protein sequence that were in common with a certain kind or family of protein. For example, in the case of IDH, I picked out matches with other dehydrogenase proteins, and ran the second iteration of PSI-BLAST. Thus, I was able to direct my similarity searches in a given direction, but it was necessary to be careful. Over-representation of a spurious match or matches -- based on chance, not true homology -- in an iteration will yield more matches that are not truly related to the original protein query sequence. In general, a good rule of thumb is to choose several known closely homologous matches to reduce the effect of one or more false hits in the next iteration.


To reduce the probability of false (non-homologous) matches when using sequence similarity programs, it is sometimes useful to create a consensus sequence for a group of homologous proteins. Since any protein truly related to the protein being investigated would be related to the homologues in different organisms, using a consensus sequence reduces the probability of a unique derived sequence in one homologue falsely matching an unrelated protein, purely by chance.

CLUSTALW is a public domain multiple sequence alignment program (Thompson et al., 1994). It generates an alignment file that can be fed into HMMER. HMMER is a suite of programs that uses Hidden Markov Models (HMMs) (Durbin et al., 1998). One of the programs in HMMER called hmmemit can be used to generate a consensus sequence for a given alignment of input sequences.

First I used CLUSTALW to generate an alignment file for my input sequences (5 for IDH1, 5 for caspase 3). Then I ran hmmemit using the alignment file as input to generate a consensus sequence. The consensus sequence was then analyzed with Divide-and-BLAST (see below).


Divide-and-BLAST (DAB) is a Perl program that uses BLAST to find remote similarities between proteins. The details of the algorithm are described on the DAB home page, where it is also available for download.

In this case DAB was run using the default parameters, i.e. length of sub-sequences equal to 20 amino acids, overlap of 10 amino acids and both expect values at the default of 10. The length and overlap parameters were chosen based on testing with ranges of values, and found to be optimal at their default values of 20 and 10 amino acids respectively.


Chime is an internet browser plug-in that allows the viewing of 3D structures (MDLI, 2000).

I employed Chime in two ways for my investigations. First, I used it to look at the tertiary structures of IDH1 and caspase 3, and compare them with other proteins that showed up as matches in the output from the different similarity search programs. Second, with Chime scripting, I was able to present the results of my study of IDH1 and caspase 3 in an interactive manner on the World Wide Web.

Back to Table of Contents

Comments? Questions? Suggestions? Please e-mail
Copyright 2000 Rahul Karnik.