Divide-and-BLAST attempts to address the problem of filtering high similarity hits from a list of hits for a sequence, leaving possibly significant weak similarity hits for further investigation. The program divides its input sequence into a number of sub-sequences, whose length and overlap can be specified as parameters. It then submits both the full sequence and each sub-sequence to the BLAST server using the BLAST network client. After receiving the results, Divide-and-BLAST then removes the hits for the full sequence from the list of hits for each sub-sequence (Fig. 1). The output is a file listing the unique hits for each sub-sequence. If a protein is found to have unique hits on more than one sub-sequence and not show up in the list of hits for the full sequence, it is very likely that there exists a significant similarity between the protein and the input sequence. Even if there are no such duplicate hits between sub-sequences, some of the relatively high similarity unique hits might warrant further investigation using other methods, computational or experimental.
Figure 1. A diagrammatic representation of the Divde-and BLAST process.
The source code for Divide-and-BLAST is available in the following two formats at the present time.
Windows
ZIP
Unix
gzipped tar file
Once you have downloaded the source code, you can unzip/decompress the file into a folder of your choice and follow the intructions for running the program below.
Navigate to the directory into which you untarred the file. Type
./dab.pl -h
at the console for a list of options
Windows:
Get to a DOS/command prompt and navigate to the folder into which you unzipped the file. Type:
dab.pl -h
at the DOS prompt for a list of options and more usage information.
Some basic DOS commands:
md dirname
make a directory called "dirname"
cd dirname
go to directory named "dirname"
cd \
change directory to root (C:\)
cd ..
move to parent directory
dab.pl <filename (options)
where <filename> is the name of the file containing the input
sequence in FASTA format and options can be one or more of the following:
-h | Prints help information |
-H | Generates HTML output file |
-l <length | Length of sub-sequences (default 20 amino acids) |
-o <overlap | Overlap between sub-sequences (default 10 amino acids) |
-e1 <expect value | Expect value for full sequence BLAST (default 10.0) |
-e2 <expect value | Expect value for sub-sequence BLAST (default 10.0) |
-O <output dir | Specify an output directory (default "output") |
A sample output for Divide-and-BLAST can be seen here.
These were the results obtained when Divide-and-BLAST was used to analyze
the human isocitrate dehydrogenase protein sequence, using sub-sequences
of length 20 amino acids and overlap of 10 amino acids. Notice the hits
for isopropylmalate dehydrogenase; Divide-and-BLAST clearly found an evolutionary
relationship, and localized it to a certain area of the sequence.
What are Expect values?
In general, higher expect values mean lower similarities and vice versa.
The Expect value parameter is the cutoff value -- any hits with Expect
values above the one specified will not be shown. Since Expect value depends
on length, sometimes increasing the Expect value for the sub-sequence BLASTs
might turn up more unique hits than with the default value of 10.0. For
a detailed explanation of Expect values, see the BLAST
FAQ at NCBI.