Nucleotide searching

Nucleotide searching

Nucleotide searching at NCBI
Are you interested in contributing to HLWIKI International – contact:[email protected]
To browse other articles on a range of HSL topics, see the A-Z index.



Last Update

  • Updated.jpg 5 May 2013


Nucleotides are molecules that comprise the structural elements of RNA (ribonucleic acid) and DNA (deoxyribonucleic acid). As such, nucleotides are the basic building blocks in nucleic acids. RNA and DNA, for example, are polymers made up of long chains of nucleotides. A nucleotide consists of a sugar molecule (either ribose in RNA or deoxyribose in DNA) which is attached to a phosphate group and a nitrogen-containing base. The bases used in DNA are adenine (A), cytosine (C), guanine (G), and thymine (T). In RNA, the base uracil (U) takes the place of thymine.
Nucleotide sequence homology (sameness) cannot be reliably detected below roughly 75% identity. Below 50%, most hits in a search database are probably noise. Most nucleotide searches are therefore medium or high-identity matches and the NCBI algorithms are usually effective.

Nucleotide searching

Nucleotide searching requires the use of Entrez's Nucleotide database where gene and nucleotide sequences are freely-searchable. The Nucleotide database is a collection of sequences from several sources, including GenBank, RefSeq, TPA and PDB. When searching databases of nucleotide or protein sequences, finding a local alignment of two sequences is one of the main tasks. A key fact is that genome, gene and transcript sequence data provide the foundation for biomedical research and discovery. Gene and nucleotide sequences are used to find:
  • homologous nucleotide sequences across species and in model organisms
  • common ancestors amongst species to determine evolutionary relationships
  • location of a gene or sequence within a genomic region and visualize through gene mapping
  • amino acid sequence, which can then be used for protein folding
You can search for nucleotides at PubMed’s main page, under "Popular" (bottom) or switch between databases on main page. Above search, there is a pull-down menu to selectNucleotide. When possible, use BLAST with amino acid sequences using BLASTp. BLASTn for nucleotide sequences assumes that all substitutions in base pairs are equal when this is not true. The rate of transition mutations (purine to purine or pyrimidine to pyrimidine) is approximately 1.5-5X that of transversion mutations (purine to pyrimidine or vice versa) in all genomes where it has been measured (see Wakely, Mol Biol Evol 11(3):436-42, 1994).
Code Degeneracy. Some amino acids are coded by more than one codon (eg. serine is coded by UCU or AGC). This leads to great variation in how the BLAST algorithm may interpret a nucleotide sequence. However, it is useful to run BLAST on nucleotide sequences. Treat it like an experiment: try blastn, megablast and blastx or tblastx.

BLAST search tool

  • In bioinformatics, the Basic Local Alignment Search Tool (BLAST) is used to compare primary biological sequence data of amino-acid sequences of proteins or nucleotides of DNA sequences. BLAST is also used to find regions of similarity between biological sequences (and calculated using statistical similarity matching).
  • The BLAST search tool can be used to:
  • identify a sequence
  • find related sequences
  • infer function
  • infer species relatedness
  • perform phylogenetic analysis
BLAST works by breaking queries into a series of “words” or set of letters. These words are compared to words from sequences in the database. Once a match is found, the sequences are aligned and scored. The number of letters per word can be changed by using the "Algorithm parameters" link on the BLAST screen. Insertions and deletions in sequences result in gaps when sequences are aligned. These gaps are assigned a certain score penalty for their existence and for their extension. These scores can also be changed from their defaults by using the “Algorithm parameters” link on the BLAST screen.
  • The value of Nucleotide Database Query is to find regions of sequence similarity, which will yield functional and evolutionary clues about the structure and function of your novel sequence.

What is being searched?

Nucleotide searches retrieve results from three databases:
  • Genome Survey Sequence (GSS) database
  • Expressed Sequence Tag (EST) - 300-500bp pieces of complementary DNA (cDNA) derived from mRNA and used to map where a gene is physically located in the genome
  • Nucleotide Core: notice references to Core Nucleotide when using PubMed which contains sequences not available in GSS and EST databases

Sources for the sequences

  • EMBL: European Molecular Biology Laboratory, part of the International Nucleotide Sequence Database Collaboration
  • DDBJ: DNA DataBank of Japan, part of the International Nucleotide Sequence Database Collaboration

How to search Nucleotide


How search results are displayed

The results are displayed from the newest addition to the oldest date that any given sequence has been entered into the database. Searchers can select to display results by accession number, organism name, taxonomy ID or date entry was modified or released. Click on the “Sort By” drop-down menu to make your selection.

How to read a sequence record

There are a number of formats available to view records such as GenBank, FASTA, Graphics, ASN.1, Revision History, GenBank (Full). The following is a description of the GenBank flat file, the default display.
  • Locus: consists of an accession number, length of sequence, molecular type and a code to signify the organism the sequence was derived from; includes latest date record was updated
  • Definition: description of the sequence
  • Accession number: accession number is a unique identifier permanently linked to a particular sequence record
  • Version: changes to a sequence will be shown by a version number affixed to an accession number as a decimal
  • Organism name: scientific name of organism, including taxonomic information
  • References: section consists of one (or more) entries that include citation information and commentaries
  • Features: contains biological information in the record, presented in a consistent manner that operates across databases
  • Origin: entire base-pair sequence for the record
See a sample sequence record.