Cleaning Illumina FASTQ reads with Nesoni clip




Read Full Article<<

Introduction

Cleaning FASTQ reads is the process of removing those bits of the reads that you don't deem good enough to be given to the next stage of your pipeline. At worst, this could mean removing the whole read, and if the reads were paired, this means some reads will become "orphan" single reads. The cleaning process is often called filtering, trimming, clipping, or pruning. A FASTQ read has three parts: a sequence of bases, and a quality score for each base, and a sequence ID. The cleaning process only has access to this information.

Sequence

In terms of the sequence itself, we may want to discard sequences which contain ambiguous DNA symbols, most commonly "N" which means the base could not be called. Generally, reads with anything other than A,T,G,C are going to cause problems with most downstream analysis tools.

Preparing DNA for sequencing almost always involves ligating "adaptor" sequences to all the fragments of DNA. When things don't go perfectly, these adaptors can end up being sequenced, and be partially present at the start/end of your reads! This can wreak havoc with downstream analyses like read alignment and de novo assembly. This mainly seems to occur with Illumina reads, especially mate-pair libraries and some RNA-Seq protocols. You should always check and remove these adaptor sequences.

Quality

Usually when we think of cleaning FASTQ reads, we mean trimming on quality. Each base in a FASTQ read gets a Phred quality score which is an integer between 0 and 93 (as encoded in the Sanger FASTQ format) although you rarely see more than "Q45" or so. Low numbers are bad, high numbers are good. The numbers are on a logarithmic scale, and represent the probability of this base not being correct. For example, a base of Q30 is expected to be wrong about 1 in 1000 times, or we are 99.9% confident in it. This means that in the millions of Q30 bases throughout all your reads, thousands of them are expected to be wrong! When we get down to Q10, one in ten bases are dodgy. Like any analysis, the "garbage in, garbage out" rule applies, and hence removing low quality reads is often a good strategy, especially if you don't know how well the downstream tool handles errors.

In Illumina reads the quality tends to get monotonically worse at the 3' end of the reads. Simply removing bases below a quality threshold T from the end is a common form of quality trimming. Alternatively you could average quality across a small window (3 bases say) and threshold that. Taking that idea to the limit, you could simply remove whole reads with average quality < T. There are many variations on this theme, and later I will discuss the approach we take.

ID

You probably think I'm crazy suggesting we can use the sequence ID to filter reads, right? In general, of course it does not make sense, however Illumina reads do contain some information in their read IDs. The most relevant are the series of colon-separated integers which some encode the coordinates of the cluster on the flowcell lane which generated this sequence. When we got the original Illumina Genome Analyzer (version 1!) it was clearly producing bad reads that came from the edges of the flowcell, due to poorer optical performance and other physical affects. You can still see this effect today when you open a FASTQ file, whereby lots of the reads at the start of the file are full of "N"s and have low quality - these come from the first tile in the corner of the flowcell lane. I'm not advocating using this information in reality, but it is interesting to keep in mind, and may actually be useful to someone if you were given a FASTA file where the quality information had been stripped away.

What is Nesoni?

Nesoni (github) is our Swiss Army knife of NGS related tools, implemented primarily by Paul Harrison in our group.  It began as a wrapper for aligning reads and calling SNPs for the 100s of bacterial genome samples we were processing, but has now evolved into an extendible, distributable pipeline system which we hope to publish soon and actually document. I'll save the full description of Nesoni for another day, and today just focus on one of the simplest but still very useful tools it has.