Nucleotide searching

Nucleotide searching at NCBI
Source: http://www.ncbi.nlm.nih.gov/nucleotide/

Are you interested in contributing to HLWIKI International – hlwiki.ca? contact:[email protected]

To browse other articles on a range of HSL topics, see the A-Z index.

[hide]

Last Update

5 May 2013

Introduction

Nucleotides are molecules that comprise the structural elements of RNA (ribonucleic acid) and DNA (deoxyribonucleic acid). As such, nucleotides are the basic building blocks in nucleic acids. RNA and DNA, for example, are polymers made up of long chains of nucleotides. A nucleotide consists of a sugar molecule (either ribose in RNA or deoxyribose in DNA) which is attached to a phosphate group and a nitrogen-containing base. The bases used in DNA are adenine (A), cytosine (C), guanine (G), and thymine (T). In RNA, the base uracil (U) takes the place of thymine.

Nucleotide sequence homology (sameness) cannot be reliably detected below roughly 75% identity. Below 50%, most hits in a search database are probably noise. Most nucleotide searches are therefore medium or high-identity matches and the NCBI algorithms are usually effective.

Nucleotide searching

Nucleotide searching requires the use of Entrez's Nucleotide database where gene and nucleotide sequences are freely-searchable. The Nucleotide database is a collection of sequences from several sources, including GenBank, RefSeq, TPA and PDB. When searching databases of nucleotide or protein sequences, finding a local alignment of two sequences is one of the main tasks. A key fact is that genome, gene and transcript sequence data provide the foundation for biomedical research and discovery. Gene and nucleotide sequences are used to find:

homologous nucleotide sequences across species and in model organisms
common ancestors amongst species to determine evolutionary relationships
location of a gene or sequence within a genomic region and visualize through gene mapping
amino acid sequence, which can then be used for protein folding

see brca1(gene) AND human(orgn) search

You can search for nucleotides at PubMed’s main page, under "Popular" (bottom) or switch between databases on main page. Above search, there is a pull-down menu to selectNucleotide. When possible, use BLAST with amino acid sequences using BLASTp. BLASTn for nucleotide sequences assumes that all substitutions in base pairs are equal when this is not true. The rate of transition mutations (purine to purine or pyrimidine to pyrimidine) is approximately 1.5-5X that of transversion mutations (purine to pyrimidine or vice versa) in all genomes where it has been measured (see Wakely, Mol Biol Evol 11(3):436-42, 1994).

Code Degeneracy. Some amino acids are coded by more than one codon (eg. serine is coded by UCU or AGC). This leads to great variation in how the BLAST algorithm may interpret a nucleotide sequence. However, it is useful to run BLAST on nucleotide sequences. Treat it like an experiment: try blastn, megablast and blastx or tblastx.

BLAST search tool

In bioinformatics, the Basic Local Alignment Search Tool (BLAST) is used to compare primary biological sequence data of amino-acid sequences of proteins or nucleotides of DNA sequences. BLAST is also used to find regions of similarity between biological sequences (and calculated using statistical similarity matching).

The BLAST search tool can be used to:

identify a sequence
find related sequences
infer function
infer species relatedness
perform phylogenetic analysis

BLAST works by breaking queries into a series of “words” or set of letters. These words are compared to words from sequences in the database. Once a match is found, the sequences are aligned and scored. The number of letters per word can be changed by using the "Algorithm parameters" link on the BLAST screen. Insertions and deletions in sequences result in gaps when sequences are aligned. These gaps are assigned a certain score penalty for their existence and for their extension. These scores can also be changed from their defaults by using the “Algorithm parameters” link on the BLAST screen.

The value of Nucleotide Database Query is to find regions of sequence similarity, which will yield functional and evolutionary clues about the structure and function of your novel sequence.

What is being searched?

Nucleotide searches retrieve results from three databases:

Genome Survey Sequence (GSS) database
Expressed Sequence Tag (EST) - 300-500bp pieces of complementary DNA (cDNA) derived from mRNA and used to map where a gene is physically located in the genome
Nucleotide Core: notice references to Core Nucleotide when using PubMed which contains sequences not available in GSS and EST databases

Sources for the sequences

GenBank: available through NIH and part of the International Nucleotide Sequence Databases (INSD); consists of annotated DNA sequences made publically-available

EMBL: European Molecular Biology Laboratory, part of the International Nucleotide Sequence Database Collaboration
DDBJ: DNA DataBank of Japan, part of the International Nucleotide Sequence Database Collaboration

Reference Sequence database contains annotated, publicly-available sequences for DNA, RNA and proteins from various organisms (i.e. viruses, bacteria, eukaryotes)
PDB: The Protein Data Bank

How to search Nucleotide

Retrieving sequencing information. Nucleotide sequence databases tutorial

How search results are displayed

The results are displayed from the newest addition to the oldest date that any given sequence has been entered into the database. Searchers can select to display results by accession number, organism name, taxonomy ID or date entry was modified or released. Click on the “Sort By” drop-down menu to make your selection.

How to read a sequence record

There are a number of formats available to view records such as GenBank, FASTA, Graphics, ASN.1, Revision History, GenBank (Full). The following is a description of the GenBank flat file, the default display.

Locus: consists of an accession number, length of sequence, molecular type and a code to signify the organism the sequence was derived from; includes latest date record was updated
Definition: description of the sequence
Accession number: accession number is a unique identifier permanently linked to a particular sequence record
Version: changes to a sequence will be shown by a version number affixed to an accession number as a decimal
Organism name: scientific name of organism, including taxonomic information
References: section consists of one (or more) entries that include citation information and commentaries
Features: contains biological information in the record, presented in a consistent manner that operates across databases
Origin: entire base-pair sequence for the record

See a sample sequence record.

Labels

Post Layout

Post Style

Fashion

Nucleotide searching

Nucleotide searching

Contents

Last Update

Introduction

Nucleotide searching

BLAST search tool

What is being searched?

Sources for the sequences

How to search Nucleotide

How search results are displayed

How to read a sequence record

References

Post a Comment

MARI themes

Labels

Post Layout

Post Style

Fashion

Nucleotide searching

Nucleotide searching

Contents

Last Update

Introduction

Nucleotide searching

BLAST search tool

What is being searched?

Sources for the sequences

How to search Nucleotide

How search results are displayed

How to read a sequence record

References

Next

Newer Post

Previous

Older Post

Post a Comment

MARI themes