Divide-and-BLAST

Introduction

Divide-and-BLAST is a Perl program written to facilitate the discovery of proteins weakly similar to a protein sequence of interest. The BLAST program at NCBI works very well when it comes to high similarity searches. Unfortunately, weak similarities are often listed below hundreds of high similarity hits, and may not even be shown if the number of hits is small or if the cutoff expect value is too low (NCBI, 1999).

Divide-and-BLAST attempts to address the problem of filtering high similarity hits from a list of hits for a sequence, leaving possibly significant weak similarity hits for further investigation. The program divides its input sequence into a number of sub-sequences, whose length and overlap can be specified as parameters. It then submits both the full sequence and each sub-sequence to the BLAST server using the BLAST network client. After receiving the results, Divide-and-BLAST then removes the hits for the full sequence from the list of hits for each sub-sequence (Fig. 1). The output is a file listing the unique hits for each sub-sequence. If a protein is found to have unique hits on more than one sub-sequence and not show up in the list of hits for the full sequence, it is very likely that there exists a significant similarity between the protein and the input sequence. Even if there are no such duplicate hits between sub-sequences, some of the relatively high similarity unique hits might warrant further investigation using other methods, computational or experimental.

Figure 1. A diagrammatic representation of the Divde-and BLAST process.

Installing Divide-and-BLAST

Prerequisites:

Divide-and-BLAST is a Perl program and requires a Perl language compiler. Perl compilers for Unix and Windows can be downloaded at www.perl.com.
The BLAST network client is required. This allows searching of the NCBI sequences databases remotely, i.e. without having a local copy of the databases. Download it from the NCBI ftp server.

Source code:

The source code for Divide-and-BLAST is available in the following two formats at the present time.

Windows ZIP
Unix gzipped tar file

Once you have downloaded the source code, you can unzip/decompress the file into a folder of your choice and follow the intructions for running the program below.

Basic Usage

Unix:

Navigate to the directory into which you untarred the file. Type

./dab.pl -h

at the console for a list of options

Windows:

Get to a DOS/command prompt and navigate to the folder into which you unzipped the file. Type:

dab.pl -h

at the DOS prompt for a list of options and more usage information.

Some basic DOS commands:
md dirname            make a directory called "dirname"
cd dirname             go to directory named "dirname"
cd \                    change directory to root (C:\)
cd ..                   move to parent directory

Detailed instructions

The program is used in the following way:

dab.pl <filename (options)

where <filename> is the name of the file containing the input sequence in FASTA format and options can be one or more of the following:

-h	Prints help information
-H	Generates HTML output file
-l <length	Length of sub-sequences (default 20 amino acids)
-o <overlap	Overlap between sub-sequences (default 10 amino acids)
-e1 <expect value	Expect value for full sequence BLAST (default 10.0)
-e2 <expect value	Expect value for sub-sequence BLAST (default 10.0)
-O <output dir	Specify an output directory (default "output")

A sample output for Divide-and-BLAST can be seen here. These were the results obtained when Divide-and-BLAST was used to analyze the human isocitrate dehydrogenase protein sequence, using sub-sequences of length 20 amino acids and overlap of 10 amino acids. Notice the hits for isopropylmalate dehydrogenase; Divide-and-BLAST clearly found an evolutionary relationship, and localized it to a certain area of the sequence.

What are Expect values?
In general, higher expect values mean lower similarities and vice versa. The Expect value parameter is the cutoff value -- any hits with Expect values above the one specified will not be shown. Since Expect value depends on length, sometimes increasing the Expect value for the sub-sequence BLASTs might turn up more unique hits than with the default value of 10.0. For a detailed explanation of Expect values, see the BLAST FAQ at NCBI.

References

National Center for Biotechnology Information. 1999. BLAST Frequently Asked Questions.<http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.html Accessed 1999 16 Dec.