FASTG-An expressive representation for genome assemblies

Introduction

FASTG is a format for faithfully representing genome assemblies in the face of allelic polymorphism and assembly uncertainty. It is called FASTG, like FASTA, but the G stands for ‘graph’.
Currently genome assemblies are represented linearly, as sequences of bases, recorded in FASTA files. Since chromosomes are in fact linear or circular, this makes sense, so long as one has complete knowledge of the genome. However, almost all assemblies contain errors and omissions, which can result in incorrect biological inferences. Moreover, in most cases these assemblies do not represent polymorphism at all.
Today, using high-coverage data, assembly algorithms 'see' almost all bases of the genome. Thus errors in the assemblies result primarily from defects in the algorithms and defects in assembly representation. Indeed, where a particular locus in an assembly is wrong, it is generally the case that the assembly algorithm could have prevented error by emitting an ambiguous call. However, such ambiguities are precluded by the current linear representation. Similarly, complex polymorphisms cannot be easily represented either and simple polymorphisms must be captured in a supporting file.
Just as physical measurements come with error bars, so should genome assemblies come with structures that capture the uncertainties in our knowledge. At its heart it is FASTA – thus allowing existing tools to run and providing coordinates that facilitate computation. On top of this are global and local layers of markup.

Specification

The current version of the FASTG Specification is available for download.
Here are the toy genome FASTG and FASTA files described in section 2.