A Computation Note for Assembling Plasmodium 3D7 with CLEAR, Part I



Title: A Computation Note for Assembling Plasmodium 3D7 with CLEAR, Part I
Author: Jason Chin
Date: May 31 2013
Update: June 4 2013

What Do You Need To Reproduce The Assembly Shown Here

  • data: pread.fa and pr_pr_strigent.m4
  • python 2.7 / IPython 0.13.2
  • pbcore from https://github.com/PacificBiosciences/pbcore
  • optional: summarizeAssembly.py from PBJelly_12.7.25 installed in ~/bin/PBJelly_12.7.25/
  • optional: nucmer and mummerplot from mummer3 ( http://mummer.sourceforge.net/)
  • optional data: reference PlasmoDB-9.2_Pfalciparum3D7_Genome.fasta from PlasmoDB (http://plasmodb.org/plasmo/)

Introduction

This is a brief note and code to show how to assemble Plasmodium 3D7 ( http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=36329 ) genome with PacBio(R) pre-assembled reads by a "Consistent Long-read Evidence Assembling pRocess (CLEAR)".

Why Do We Sequence Plasmodium 3D7 for This Assembly Example?

Plasmodium is a parasite that causes malaria. Understanding its genetics will help to find a cure to the disease. From the sequencing technology point of view, it posts a great challenge to sequence and assembly the genome. Due to its very in-balanced AT/GC content (AT ~= 80% and GC ~=20%), most sequence technology can not produce good and long sequences that enables assembling the genome into long contigs. For example, the earlier publication using 2nd geneneration sequence technology can only get contig N50 about 1 to 4 kbp (BMC Genomics. 2011; 12: 116, http://www.biomedcentral.com/1471-2164/12/116). (See other related assembly statistics from http://www.broadinstitute.org/annotation/genome/plasmodium_falciparum_spp/AssemblyStats.html) Using Sanger sequencing technology will get a better results of which the contig N50 is about 10 to 20kb. Here we demostrate that using PacBio(R) RS Single Molecule Real-Time (SMRT(R)) sequencing technology, we can easily assemble the genome much better results (N50 ~= 954kb about 43x of the) than the earlier 2nd gen. sequencing results even with some simple home-made assembly code. 
We choose the 3D7 strain because the avaiability of DNA and it is the only one that has good finished reference that we can compare our results. (see also http://www.broadinstitute.org/annotation/genome/plasmodium_falciparum_spp/GenomesIndex.html). We expect the performance will be similar to other strains of Plasmodium.