Trimming 454 reads at the 5' end may improve fidelity


I'm a bioinformatician at the University of Akron. I discuss my thoughts and bioinformatics research on my blog Bioinformatics Zen. You can contact me by email at [email protected] or on twitter @bioinformatics.Read more

A previous post described training machine learning classifiers to filter inaccurate reads in 16S-based metagenomic studies. This analysis was based on available mock community data - where an known set of 16S genes is sequenced and the generated reads compared to the original sequences. The value of using a mock community is that knowing the original sequences allows you to identify the types of errors you are observing in the output data. The method I previously used to find the accurate reads using the mock community data was as follows:
  1. Remove barcodes and primers from reads and trim down to 200bp.
  2. Select the unique representative reads and count how many times each unique read is observed in the data.
  3. Build an alignment from N*2 of the most abundant unique reads. N is the number of sequences used in the mock community data.
  4. Use single linkage clustering to group the aligned unique reads to within 4bp differences of each other. This merges the unique read that matches to with 4bp of another unique read into the more abundant read group.
The result of this method should be N unique reads in the output data and each unique read should then correspond to one of the mock community sequences. This approach assumes that after removing up to 4 single base errors and indels, the remaining most abundant unique read for each mock community sequence should be the correct one. This assumption can be described another way: any novel sequences generated from infidelity in the PCR or sequencing steps should not produce more copies than that of the original correct sequence.
I built this workflow with a combination of mothur and shell scripting. There is a gist on github illustrating this method of identifying correct reads based on abundance.
This approach bothered me because it does not actually compare the reads that I specify as correct against the original mock community. Therefore before pursing further machine learning approaches I aimed to first confirm the reads identified as accurate exactly match the mock community sequences.

Read fidelity in Quince et. al 2009 data

I used a simple approach to identify accurate reads in the 454 data: trim each mock community sequence to a 100bp substring and see how many reads contained this substring. I started with the divergent mock community data from Quince et. al 2009. This is the same data I used for machine learning approach to filtering inaccurate reads. I trimmed the community sequences to 100bp beginning at 5’ end and searched the generated read data for exact matching substrings using grep. The following code illustrates this approach with the resulting number of matches.
fastx_trimmer -l 100 -i -o trim100.fasta
grep --invert-match '>' trim100.fasta \
  | sort | uniq \
  | parallel 'grep {} SRX002564.fasta | wc -l' \
  | sort -r -n \
  | tr "\\n" " "
#=> 1609 1543 1024 14 5 5 4 4 3 3 2 2 2 2 1 1 0 0 0 0 0 0 0
This shows a large variance in the number of reads matching a mock community sequence substring. There are 3 with >1000 matching reads while the remainder of match <20 reads. I double-checked this result by examining a couple of the community sequence reads individually. This is illustrated below where the community sequence is shown in the top row and generated reads after removal of the primer are shown in the rows below.
>8                      CTA-ACGATGA-TACTCGAT
>SRR013437.14   tccacacc...A.......A........
>SRR013437.627  tccacacc...A.......A........
>SRR013437.935  tccacgcc...G.......A........
>SRR013437.2322 tcctcgcc...A.......A........
>SRR013437.2423 cccacgcc...A.......A........
>72                  GCCGTAA-CGATGAGAACTAGCC
>SRR013437.245  tccac.......A...............
>SRR013437.351  tccac.......A...............
>SRR013437.433  tccac.......A...............
>SRR013437.542  tccac.......A...............
>SRR013437.661  tccac.......A...............
This highlights that, for two of these mock community sequences, there appear to be insertions in the 5’ ends of the most abundant reads. These differences prevent exact substring matches to the mock community sequence. I considered if this was the case for the for further downstream sequence by repeating this process with a 100bp substring 50bp downstream from the 5’ end.
fastx_trimmer -f 50 -l 150 \
  -i \
  -o trim50_150.fasta
grep --invert-match '>' trim50_150.fasta \
  | sort | uniq \
  | parallel 'grep {} SRX002564.fasta | wc -l' \
  | sort -r -n \
  | tr "\\n" " "
#=> 2273 2229 1943 1873 1802 1795 1774 1701 1692 1677 1670
#   1628 1596 1573 1545 1522 1446 1442 1399 1356 935 635 548
This shows that using a substring 50bp further downstream finds a larger number of exact matches in the reads. The following figure compares the difference in the substring matches. The figure is on a logarithmic scale.