Blastdbinfo: API access to a database of BLAST databases

 

 Blastdbinfo: API access to a database of BLAST databases



NCBI offers extensive collections of sequences through its BLAST services (http://blast.ncbi.nlm.nih.gov) for comparing and identifying DNA, RNA and protein sequences. NCBI now deposits descriptions of these sequence collections, known as BLAST databases, in a special database called blastdbinfo that you can access through the Entrez Programming Utilities (E-Utilities). Using blastdbinfo, you can enable a program to find an appropriate database and then send BLAST searches to that database using either the BLAST URL API or standalone BLAST (installed locally).

If you’re unfamiliar with the E-Utilities, please see the E-Utilities documentation for a full description of these tools.

Procedure

1. Use esearch.fcgi to find desired BLAST databases (see Table 1 below for a listing of several useful query fields).

 esearch.fcgi?db=blastdbinfo&term=<database query>

[Parse out database ID from XML output]

2. Use esummary.fcgi to retrieve metadata about the matching databases.

esummary.fcgi?db=blastdbinfo&term=<database ID>

[Parse out database path from XML output]

3. Run a BLAST search with the desired database.

Blast.cgi?CMD=Put&DATABASE=<database path>&PROGRAM=<program>&query=<query>

Example

For this example, we will look for human BLAST databases containing sequences from the NCBI Reference Sequence (RefSeq) Project. Click on the links to view the results of each step.

1. Use esearch with the following query (see Table 1):

refseq[blast database source] AND human[title]


The first few lines of the returned XML result appear below.

<eSearchResult>
<Count>13</Count>
<RetMax>13</RetMax>
<RetStart>0</RetStart>
<IdList>
<Id>1023214</Id>
<Id>1001294</Id>
<Id>998664</Id>
…

2. Use summary to retrieve the names and paths of the databases. In this case, we will use ID 1023214.


The first few lines of the esummary XML appear below.

<eSummaryResult>
<DocumentSummarySet status="OK"><DocumentSummary uid="1023214">
<Name>Human build 37 RNA, reference, and alternate assemblies</Name>
<Path>DBINDEX/9606/allcontig_and_rna</Path>
<Title>human build 37 RNA, alternate and reference assemblies.</Title>
<LastUpdated>2010/11/01 00:00</LastUpdated>
<Description/>
<TotalLength>5886906670</TotalLength>
<MaxLength>115591998</MaxLength>
<NumSequences>50354</NumSequences>
…

The BLAST database name and its path prefix are in the <Path> field. We can use the complete string in this field to compose a search request using the BLAST URL API or standalone blast+.

3.  Use the BLAST URL API to invoke the database (in red):


For standalone BLAST, you can invoke the database on the command line:

blastn -db DBINDEX/9606/allcontig_and_rna  -remote -query <query_file> …

Table 1 – Some useful query fields in blastdbinfo

Query Field Sample Values Example Function
[blast sequence strategy] (nucleotide databases only) est
gss
htgs012
htgs0123
wgs
wgs[blast sequence strategy] Retrieves all databases containing wgs sequences
[blast database source] genbank
gnomon
pdb
refseq
sra
swissprot
refseq[blast database source] Retrieves all databases containing RefSeq sequences
[blast sequence type] cdna
genomic
otherdna
protein
Protein[blast sequence type] Retrieves all databases containing protein sequences
[title] Text words within the database title Non-redundant[title] Retrieves databases with “non-redundant” in their title

For more information

For a complete list of all available field limits for the blastdbinfo database, visit this link: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=blastdbinfo


For technical assistance on BLAST, write to [email protected].